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Abstract 


The influence of recent developments in supercomputing on computa- 
tional chemistry is discussed with particular reference to Cray computers and 
their pipelined vector /limited parallel architectures. After reviewing Cray hard- 
ware and software we examine the performance of different elementary program 
structures and outline effective methods for improving program performance. We 
then discuss the computational strategies appropriate for obtaining optimum per- 
formance in applications to quantum chemistry and dynamics. Finally, some dis- 
cussion is given of new developments and future hardware and software improve- 
ments. 
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I. Introduction 

The advent of supercomputers has had a profound influence on the devel- 
opment of computational chemistry in the last decade. Any increase in comput- 
ing power from a given level will, of course, increase the range and size of prob- 
lems that can be studied, but the influence of supercomputers goes much deeper 
than this. The need to develop new algorithms to exploit supercomputers fully 
affects formal mathematical aspects of computational chemistry methodology, 
while in some areas the ability to perform calculations at new levels of accuracy 
or on new chemical systems can alter entire computational chemistry strategies. 

In the present review, our goal is to show how the supercomputers produced by 
Cray Research Inc. (CRI) have influenced two areas of computational chemistry 
— molecular electronic structure and dynamics calculations. We shall discuss as- 
pects of performance and programming for a range of Cray computers, and show 
how these factors have influenced the methodology and implementation in these 
areas. We shall also discuss how the results obtained from calculations have in 
turn influenced the philosophy behind the calculations. 

At the time of writing, several different supercomputer models are pro- 
duced by CRI. All are parallel, pipelined vector computers. The CRAY X-MP 
series is a development of the original CRAY-1: the machines are equipped with 
one to four CPUs, up to 16 mega, words (MW) of very fast memory (up to 64 MW 
in the latest models) with multiple channels to each CPU, and a clock period of 
8.5 ns (10 ns on the “se” series). The X-MP machines are characterized by excel- 
lent scalar performance and a rather rapid convergence to their asymptotic per- 
formance limit, that is, good performance on arithmetic involving short vectors. 
For example, they achieve over half their asymptotic performance of 235 million 
floating-point operations per second (MFLOPS) in multiplication of matrices of 
order 11 x 11. The CRAY Y-MP is a natural successor to the X-MP, with a 
clock period of 6 ns and up to 32 MW of memory. Its performance characteristics 
are very similar to those of the X-MP. The CRAY-2 is a rather different devel- 
opment, compared to the X-MP, from the original CRAY-1. The clock period is 
only 4.1 ns, and the machine can be equipped with up to 512 MW of relatively 
slow memory. As a consequence, at well over 400 MFLOPS the asymptotic per- 
formance is even higher than the Y-MP, but the performance on scalar or short 
vector arithmetic is not as good, proportionally, as on the X-MP series. This dif- 
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ference should not be overemphasized, however: the CRAY-2 still reaches half its 
asymptotic performance with the multiplication of matrices of order 25 x 25. In 
this sense all the Cray computers can be regarded as generally similar in achiev- 
ing excellent performance without the need to go to long vector lengths. They are 
thus quite different from supercomputers like the CDC CYBER 205, or many of 
those from Japanese manufacturers. 

Obtaining high performance from Cray computers requires the proper ex- 
ploitation of parallelism in codes to use multiple functional units, pipelined vector 
hardware, and (where available) multiple CPUs. We can consider parallelism as 
arising at several levels. At the lowest level is the possibility of parallel execution 
of instructions derived from a single program statement, using the multiple func- 
tional units. At the next level is vectorization, the “parallel” execution of loop 
iterations using the vector functional units. From the programming point of view 
the fact this is not strictly parallel but pipelined execution is usually irrelevant. It 
is also possible to execute different loop iterations (or groups of iterations) on dif- 
ferent CPUs. This next level of parallelism, termed “microtasking” by CRI [1,2], 
is more elaborate than the lower levels, as code must be included to acquire, re- 
lease and possibly synchronize other CPUs. An even higher level of parallelism 
is the execution of larger code fragments (individual subroutines, say) on several 
CPUs simultaneously. This approach, referred to as “macrot asking” by CRI [1,2], 
may involve execution of quite different tasks on different CPUs, whereas micro- 
tasking would usually have ax.ll CPUs executing the same instructions but with 
different data. Finally, the highest level of parallelism corresponds to running sep- 
arate user jobs on the various CPUs: this is conventional multiprocessing at the 
operating system level. We shall be particularly concerned with the implementa- 
tion of parallelism a.t these various levels in computational chemistry codes in this 
review. 

Our aim here is to explain how various methods of computational chem- 
istry can be formulated to take maximum advantage of the power of Cray super- 
computers. In order to provide the necessary background for this we discuss the 
characteristics and performance of different Cray computers, and then we con- 
sider a number of computational chemistry activities in some detail. Our empha- 
sis will be on the utilization of Cray computer parallelism and we assume readers 
are already familiar with the formalism of ah initio electronic structure method- 
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ology [3] or of classical and quantum mechanical scattering [4-6]. We should also 
note here that as we routinely perform calculations using computers other than 
those from CRI, we generally avoid programming techniques that are very ma- 
chine specific. In particular, we eschew the use of assembly language, and we gen- 
erally attempt to implement algorithms that will be reasonably efficient on other 
supercomputer architectures. Some examples of algorithm choice motivated in 
part by these portability considerations are presented in our discussion. 

In the next section, we discuss the various Cray computers, both hard- 
ware and software. In section III we describe the performance characteristics of 
the different machines for some typical code fragments, and deduce from these 
results the appropriate techniques for obtaining maximum performance. In sec- 
tions IV and V we discuss the implementation of these techniques in various 
steps of molecular electronic structure calculations and dynamics calculations, 
respectively. These sections also review the ways in which supercomputers have 
influenced different aspects of computational chemistry. Section VI comprises 
our conclusions; we also speculate on the role forthcoming machines, such as the 
CRAY-3, will play in computational chemistry. 



II. Cray computers 
A. Hardware 

We shall discuss the performance and use of the CRAY X-MP, 

CRAY Y-MP, and CRAY-2 computers. As stated above, these are pipelined vec- 
tor processors with multiple high-speed CPUs. Vector arithmetic is performed 
on operands held in eight 64-element 64-bit vector registers. Results from one 
floating-point unit can be passed to other floating-point units while being re- 
turned to the registers on the X-MP and Y-MP: this is referred to as “chain- 
ing”. This feature is not available on the CRAY-2. Our discussion of the X-MP 
is concerned primarily with a four CPU eight MW X-MP/48, an older ma- 
chine with a clock period of 9.5 ns as opposed to the 8.5 ns of the latest X-MPs. 
This X-MP/48 features hardware GATHER/SCATTER and is configured with 
an eight MW input/output processor and a 128 MW solid-state storage de- 
vice (SSD). The main memory is divided into 32 banks, and has a memory access 
time of eleven clock periods to load a single word into a CPU register. Successive 
banks can deliver words to the registers in successive clock periods, but each bank 
is busy for four clock periods after initiation of a read request. Each CPU has 
four channels to memory: two read, one write and one input/output (I/O). We 
shall also discuss some experience with an X-MP/14se , with a single CPU (clock 
period 10 ns) and four MW of memory. Comparisons between the performance of 
the X-MP/48 and the X-MP/14se are given below. 

We shall discuss the performance of two slightly different CRAY-2 
computers. Both are four CPU machines with 256 MW of main memory and a 
clock period of 4.1 ns. Each CPU has a single channel to memory, and 16 kilo- 
words (KW) of very fast “local memory” that can be used as a type of cache to 
enhance performance. However, explicit control over local memory is available 
only to the assembly-language programmer, although high-level language compil- 
ers are being extended to generate code to utilize this hardware, as are some li- 
brary routines. The CRAY-2 main memory is divided into 128 banks, but a soft- 
ware technique (“pseudo- double banking”) provides emulation of 256 banks. Like 
the X-MP, successive banks can deliver words in successive clock periods. How- 
ever, the CRAY-2 memory is, in effect, divided into quadrants, and in any partic- 
ular clock period a given CPU can access only one of the quadrants. This causes 
significant complications in a multi-user environment, as when a user job recovers 
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a CPU and starts to issue memory requests it may well find the requests “out of 
phase” with the CPU quadrant access. Even in dedicated mode it is possible for a 
job to be up to three clock periods out of phase with memory. The two CRAY-2s 
discussed differ in that the older machine has a bank busy time of 57 clock peri- 
ods, while the newer machine (referred to below as CRAY-2*) has a somewhat 
reduced bank busy time of 42 clock periods. Clearly, there is a very significant 
difference between the memory performance on the CRAY-2 and on the X-MP 
and this difference appears in almost all performance comparisons, as we will see 
below. We shall also present results obtained on a CRAY Y-MP, equipped with 
8 CPUs (6 ns clock period), 32 MW of memory (256 banks) and a 256 MW SSD. 
As expected, the Y-MP behaves very like the X-MP but scaled in performance 
by the faster cycle time. 

Various disk subsystems are available from CRI; the machines to which 
we have access are largely configured with DD49 disk drives, with 150 MW of 
storage per drive and a data transfer rate on the order of 1 MW per second 
(MW/s) for a single unit. The X-MP and Y-MP have input /output (I/O) pro- 
cessors, essentially a large memory (8 MW on our X-MP/48, 32 MW on our 
X-MP/14se) buffer between CPU and disk. The transfer rate between the I/O 
processor and CPU is more than ten times faster than the transfer rate to disk, 
and substantial improvements in I/O performance can be achieved by “striping” 
files across multiple disks, so that successive blocks are written to different drives. 
These blocks can be read into the I/O processor in parallel, and then the data 
can be transferred to the CPU. Even better I/O performance can be obtained by 
using the SSD, available for the X-MP and Y-MP machines. The X-MP/48 we 
discuss has two channels to the SSD, and each is capable of transferring data at 
over 150 MW/s. In addition, the SSD has no overheads associated with head po- 
sitioning and rotational delays, so that I/O can be performed with essentially no 
I/O wait time. 

In the latest system versions I/O processing on the CRAY-2 also uses 
disk striping, and the larger memory of the CRAY-2 allows large buffers to be 
used for read-ahead/write-behind on sequential system I/O. Further, by taking 
advantage of this large memory when programming sorting steps it will often be 
possible to sort in memory, instead of using direct access disk I/O. 
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B. Software 
1. Operating systems 

There are three operating systems in common use for Cray computers: 
COS, UNICOS and CTSS. COS is a batch-job-oriented system that runs on the 
X-MP (and the CRAY-1). UNICOS is an interactive system for the CRAY-1, 
X-MP, Y-MP, and CRAY-2, based on AT&T UNIX System V with significant 
extensions (including some from Berkeley UNIX). CTSS is another interactive 
system, based on the Livermore Time- Sharing System. We will not discuss CTSS 
further here, nor will we consider the Guest Operating System, which allows UNI- 
COS to be run on some CPUs of a multiprocessor system and COS on others. 

The X-MP/48 results we discuss are obtained under COS, while the X-MP/14se, 
Y-MP, and CRAY-2 results are all obtained under UNICOS. 

COS, as noted above, is a batch -job-oriented system: it has a fairly lim- 
ited repertoire of job control language (JCL) commands, but these nevertheless 
provide essentially all the functionality required to carry out large-scale scientific 
computations. In addition to compilers (discussed below) and loaders, COS [7] 
features program maintenance utilities, permanent file management and archiv- 
ing, and dynamic (transparent to the user) management of resources such as the 
SSD or scratch file space for local files. However, CRI is apparently committed 
to UNICOS as its recommended operating system, and support for COS is being 
gradually withdrawn. For example, new sites are normally installed with UNI- 
COS as the operating system. While this may have some advantages from the 
point of view of operating system maintenance (COS being written largely in 
Cray assembly language (CAL) and UNICOS largely in C), it is not clear how 
much benefit the end-user sees, especially the end-user interested mainly in large- 
scale scientific calculations. This issue is discussed at greater length below. 

UNICOS [8] features essentially a complete set of UNIX commands, in- 
cluding both the Bourne and C shells and the TCP/IP remote file transfer (FTP) 
and communications (TELNET) package. UNIX itself is rather deficient in as- 
pects of resource management and batch job execution (no queues, no scratch 
files, etc), and these deficiencies are being addressed as part of the development of 
UNICOS. For example, the Network Queueing System (NQS) [8] has been incor- 
porated into UNICOS in order to provide control over job execution and queues. 
However, UNICOS version 4.0 (the latest release at the time of writing) does not 
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yet provide as flexible a batch environment as COS, at least from the point of 
view of the production user, although future releases should improve this situa- 
tion. It is also unfortunate that at the present time the X-MP and CRAY— 2 run 
under slightly different dialects of UNIX, especially in view of the reputation of 
UNIX as a portable operating system. Presumably the differences will gradually 
be eliminated as UNICOS matures. 


It may be useful here to compare UNICOS and COS as operating sys- 
tems as they are seen by a user interested in large-scale scientific calculations. 
First, there is certainly some convenience in having interactive access to the ma- 
chine from the point of view of compiling, debugging, etc, although the delay in 
turnaround for small jobs in going through, say, a VAX station to X-MP COS 
should seldom be significant. Further, as many of the UNIX commands were de- 
vised to assist in program development there is a wider range of powerful software 
tools available in UNICOS than in COS. On the other hand, from the point of 
view of running production jobs, UNICOS (strictly, UNICOS plus NQS) in its 
present version (4.0) appears to be inferior to COS. The dearth of resource man- 
agement facilities in UNICOS (for example, the lack of any method for allocat- 
ing and managing scratch file space beyond user goodwill in CRAY-2 UNICOS) 
makes the COS environment much more convenient for the production user. Fur- 
ther, comparisons suggest that user job throughput on the same hardware can 
be considerably (several times) higher using COS. This is rela.ted in part to the 
rather poor I/O performance we have experienced under X-MP UNICOS, where 
sequential I/O performance has been severely impacted when several jobs are run- 
ning [9]. This affects CPU times as well as wall clock times, by a factor of four or 
more: we have observed a factor of ten increase relative to COS when I/O is done 
with the default file block size. Finally, at this time (using COS version 1.16) 
none of our production tasks reveal operating system bugs in COS that require 
workarounds or special tactics (the issue of compiler bugs is treated below), while 
even under UNICOS 4.0 certain codes will not operate correctly without special 
modifications. Again, this situation will no doubt improve as UNICOS matures, 
but at present our view is that computational chemists are better served by COS 
than by UNICOS. 
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2. FORTRAN compilers 

As discussed in the Introduction, our aim is to avoid the use of assembly 
language and to use a high-level language for our codes. The debate over FOR- 
TRAN’S deficiencies continues, but for most scientists the accumulation over a 
number of years of a considerable body of functioning software written in FOR- 
TRAN, together with the availability of optimizing compilers, makes it the high- 
level language of choice (see, e.g. Ref. 10). 

The X-MP and the CRAY-2 each offer two compilers invoked as CFT 
and CFT77. It is a little unfortunate in view of the nomenclature that the CFT 
compilers on the two machines are not the same: CRAY X-MP CFT [11] is a 
product line that goes back to the CR AY-1, and has been fully compatible with 
the FORTRAN 77 standard since 19S1 (version 1.10); the CRAY-2 compiler [12], 
whose product name is CFT2, is a FORTRAN 66-ba.sed compiler (again origi- 
nally derived from an early version of the CRAY-1 compiler) that is still not com- 
patible with the FORTRAN 77 standard. Both of these compilers are written in 
assembly language. CFT77 [13] is (in principle) the same product on the X-MP 
and the CRAY-2, and as its name suggests is fully compatible with the FOR- 
TRAN 77 standard. This compiler is written in Pascal, and as a consequence is 
much slower in compiling code than either version of CFT (sometimes by an or- 
der of magnitude or more), but this is only of consequence when large programs 
have to be completely recompiled. CRTs long-term commitment seems to be to 
CFT77: CFT is not available on the Y-MP, for example. 

Both the CFT and CFT77 compilers feature extensive optimization and 
vectorization capabilities, as would be expected; CFT77 also includes explicit ar- 
ray syntax following the FORTRAN SX proposal [10]. In addition, the newest 
release of CFT77 provides some capability for recognizing microtasking direc- 
tives [1,2] and generating code to run on multiple CPUs. The compiler optimiza- 
tion involves both local and global (critical path) techniques similar in their ef- 
fect to other advanced FORTRAN compilers. The compilers’ ability to vectorize 
code has improved substantially over the years, especially in the area of condi- 
tional statements in loops, or generation of code for vectorized and non-vectorized 
versions of loops with run-time decision as to which is correct to use. While a 
number of constructs still inhibit vectorization, the compilers are being improved 
constantly in this area, and several examples are discussed later. As is common 
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with vectorizing compilers, it may be necessary to inform the compiler explicitly 
that certain constructions should not inhibit vectorization, because from the code 
alone it may be impossible to tell whether vectorization should be allowed. In the 
loop 

DO 10 I = 1,11 
A(I+M) = A(I) 

10 CONTINUE 

the loop can be vectorized if M > N, but the compiler cannot check this at com- 
pile time, and would flag the loop as non-vectorizable. By including a directive 
informing the compiler that M > N always holds, that is, to ignore any potential 
addressing problems — “vector dependencies” — the user can ensure the loop 
will be vectorized. In principle, a test for such dependencies could also be gen- 
erated for execution at run time, but this facility is not currently available. The 
compiler directives [11-13] have the form of a FORTRAN comment statement, so 
this does not interfere with portability; however, while other manufacturers also 
employ such devices there is no standardization of directive names or syntax. 

For efficient use of Cray computers, it is imperative to make maximum 
use of data once it has been loaded into the vector registers. The compilers are 
still not sophisticated enough to relieve the programmer of this job entirely: for 
example, the unrolling of DO-loops [14] usually enhances code performance under 
CFT or CFT77, unlike, say, the Alliant FX/FORTRAN compiler, for which user 
unrolling of loops generally reduces performance. 

The latest improvements in CFT77 notwithstanding, FORTRAN sup- 
port for using multiple CPUs on the various Cray machines is still fairly prim- 
itive [1,2]. Microtasking requires special constructs to be inserted in the code, 
which is then passed through a pre-processor to generate small CAL routines to 
manage the multitasking. It is this pre-processing step that the latest version of 
CFT77 appears to be able to manage for itself. Macrotasking, which normally in- 
volves coarse-grained parallelism, is handled through a set of FORTRAN-callable 
routines to organize starting multiple tasks and synchronizing them. In the very 
earliest form of microtasking, it was intended that microtasked versions of var- 
ious library routines would be available, but this is not the case in the present 
release. This is a great pity, as a method of using multiple CPUs that was essen- 
tially transparent to the user (in particular, that required no non-standard code 
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modifications) would be very valuable, as would automatic compiler-generated 
parallel code for DO loops (especially outer DO loops where inner DO loops 
have been vectorized). These facilities are already present in compilers and li- 
braries for other machines, like the Alliant FX/8 [15,16], and will no doubt ap- 
pear in the CFT77 compiler as it is further developed. Microtasked libraries are 
also promised by CRI. In view of the trend to larger numbers of CPUs (8 on the 
Y-MP, 16 on the CRAY-3) it seems imperative to provide strong support for 
multitasking — as the number of CPUs grows it will become increasingly diffi- 
cult to keep a machine busy with individual jobs unless very sophisticated job 
schedulers are developed. 

One of the inevitable consequences of exploiting machine parallelism is 
an increase in the memory required for a given computational task. One of the 
“8X” extensions to FORTRAN is the concept of allocatable arrays [10], storage 
for which is allocated dynamically on entry to a subroutine and which is deas- 
signed (i.e. deallocated, at least in principle) on exit from the subroutine. This 
facility is available in recent versions of CFT77, but unfortunately the allocated 
storage is not returned on exit from the subroutine, so that the overall memory 
length can grow but not shrink using this approach. The use of allocatable ar- 
rays also incurs substantial overheads at present. More traditional methods of 
“dynamical allocation” are based on user- controlled expansion of the field length 
to adjust the size of blank common, which is conventionally loaded at the end of 
user memory. In this way memory can be acquired and returned as desired. How- 
ever, this approach reflects a fairly primitive philosophy — that of using a user 
stack and assuming that no other part of the job will alter the field length. This 
assumption is vitiated at the outset in COS, for example, where system buffers 
are allocated at the highest user addresses, “growing” downwards as more files are 
opened. Care is therefore required to ensure that the user’s expanding stack does 
not collide with the system. In general, with more sophisticated memory manage- 
ment features becoming available (like the heap-based allocation in UNICOS [17]) 
it seems preferable to avoid expanding blank common and to make more use of 
allocatable arrays. This is especially true in multitasked jobs, where special atten- 
tion must be paid to avoiding conflicts in addressing data structures in different 
tasks. The use of allocatable arrays within each task is clearly a much simpler ap- 
proach than trying to coordinate the expansion of blank common by more than 
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one task. It is to be hoped that this very powerful feature will be fully imple- 
mented soon, as it will provide an important part of the support for multitasking. 

In view of the sophistication required of a compiler that is to perform 
global optimization, vectorization, and some exploitation of multitasking, it is not 
entirely unexpected that the Cray compilers occasionally produce erroneous code. 
Much of this (especially under CFT2 on the CRAY-2) derives from attempts to 
optimize too large and complicated a code segment, and can be cured simply by 
decreasing the maximum size of a code block the compiler will try to optimize. 
Further, errors seldom cause a code to produce answers close to the correct values 
— they usually produce obviously incorrect results. Nonetheless, such errors are a 
considerable nuisance, as they can often be data-dependent, and will disappear in 
small, manageable test calculations, occurring only in large, expensive production 
runs. Unlike the situation with the UNICOS operating system, where there has 
been an overall improvement a,s new sj^stem versions have been released, more re- 
cent compilers have displayed more problems than older versions. This is partic- 
ularly true of the X— MP CFT compiler. No doubt the adoption of CFT77 as the 
sole FORTRAN compiler will allow more effort to be concentrated on improved 
and error-free generation of code to exploit the various levels of parallelism. 
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III. Performance 


A. Elementary computational chemistry kernels. 

A number of previous studies of Cray computer performance in the con- 
text of computational chemistry have appeared, and timing information on vari- 
ous types of constructs (primitive operations or “kernels”, as they are termed m 
the study by Saunders and Guest [18]) that occur widely in computational chem- 
istry codes have been given. In the present work we will use two typical kernels 
to illustrate particular performance aspects of different Cray computers. We will 
then briefly present performance figures of some other representative kernels. We 
should note here that the timing results we present are subject to some uncertain- 
ties, especially on the CRAY-2 where memory access delays from other user jobs 
can make observed times reproducible to no better than 10%. 

The first kernel we will consider is the addition of a multiple of one vec- 
tor to another, given in FORTRAN as 
DO 10 I = 1,N 

A(X) = A(I) + SCALAR*B (I) 

10 CONTINUE 

(referred to as SAXPY, using the BLAS name [19]). This is a rather typical sim- 
ple vector operation and its behavior will serve as a model for other loops. The 
second kernel is the more elaborate operation of matrix multiplication, repre- 
sented in FORTRAN (with a SAXPY inner loop) by the code 

DO 10 J “ 1 ,N 
DO 20 I = 1 ,N 

C(I , J) = A(1 , 1) *B(1 , J) 

20 CONTINUE 

DO 30 K = 2 ,N 
DO 40 I = 1,N 

C(I,J) = C(I , J) + A(I,K)*B(K, J) 

40 CONTINUE 

30 CONTINUE 

10 CONTINUE 

Matrix multiplication is a particularly efficient operation on all Cray computers 
and it therefore offers the greatest scope for enhancing program performance. In 
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practice, what is often desired is the more elaborate expression 

C = aC -j- /?AB, (Uhl) 

which forms the proposed Level 3 BLAS specification [20]. We will discuss later 
the handling of the more general forms and of sparseness in the matrices. 

For some simple kernels, it is possible to estimate the expected perfor- 
mance by considering instruction times. However, in most cases this is not very 
fruitful, as both compiled code and library code may take advantage of special 
features (or may be handicapped by special problems such as bank conflicts) and 
often produce quite different rates. We have chosen here to discuss only observed 
performance obtained on various machines (running a normal job mix, not in 
stand-alone mode) under different compilers. By analogy with the approach to 
hardware characterization of Hockney and Jesshope [21], we can evaluate from 
the performance data two quantities characteristic of each kernel: the maximum 
performance, Too, (a rate in MFLOPS) and the vector length at which half this 
performance is observed, denoted n\j 2 - (We should note that since most opera- 
tions do not yield monotonicalfy increasing performance rates, is not taken as 
the asymptotic rate, but rather as the highest observed rate for any vector length 
up to 1024.) In addition to the rather detailed discussions of SAXPY and matrix 
multiplication performance, we tabulate Too and n ly / 2 values for a number of other 
kernels concerned with both arithmetic and data motion. 

B. SAXPY and matrix multiplication. 

Table 1 shows the performance for the SAXPY operation obtained on 
a CRAY X-MP/48, using the CFT and CFT77 compilers, and also the SAXPY 
subroutine from CRTs scientific subroutine library, SCILIB [17,22]. It is evident 
that the best performance is obtained using the CFT77 compiler, avoiding the 
overhead associated with calling the library routine. CFT77 also displays the 
smallest njy 2 value. The better performance of the code obtained from CFT77 
relative to CFT is rather typical of the behavior of the two compilers (at least 
at the time of writing): CFT77 commonly produces code faster by 20 to 30% or 
more. All times in Table 1 show a steady growth in performance as the vector 
length increases, up to a vector length of 63. Somewhat unexpectedly, in view of 
the “natural” vector length of the machine, the performance for vector lengths 
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of 64 is not quite as high as for 63, although this may reflect the fact the code 
generated is designed to cope with vectors longer than 64. This behavior is not 
observed for other pairs of lengths such as 127 and 128, or 255 and 256. Finally, 
we note that the best rate observed for the SAXPY operation falls short of the ul- 
timate performance (210 MFLOPS for an X-MP with 9.5 ns clock), even though 
both an addition and multiplication are required in the innermost loop and these 
operations can be chained on the X-MP. 

A comparison of SAXPY performance on several different Cray comput- 
ers is given in Table 2. Only the approach that gives the highest r 0 0 is shown: 
this is the simple FORTRAN loop of the previous subsection compiled with 
CFT77 on the X-MP machines and the SCILIB version of the BLAS call on the 
CRAY-2 machines. This difference probably reflects the absence of hardware 
chaining on the CRAY-2, which could be partially compensated for by care- 
ful assembly language coding of a library routine but which is probably beyond 
the compiler’s capabilities. The best performance on the CRAY-2 is much less 
than the theoretical 488 MFLOPS, so even less of the machine’s power can be 
exploited this way than on the X-MP. Also, while the CRAY-2 performance im- 
proves steadily up to vector lengths of about 64 there are numerous fluctuations 
above this value that do not correlate with any obvious hardware characteristics. 
The timing tests were performed on production machines in multi-user mode, 
so these fluctuations may represent problems with bank conflicts engendered by 
other user jobs and by context switching. 

The X-MP/48 matrix multiplication results displayed in Table 3 provide 
a different perspective on performance. Here, the SCILIB routine MXM outper- 
forms all the FORTRAN implementations listed. The latter comprise the SAXPY 
inner loop form given explicitly in the previous subsection, a similar approach 
with a dot product inner loop, and the SAXPY inner loop form as above with 
the J loop unrolled four times. Although the unrolled loop structure gives quite 
good performance with larger array dimensions, it is still not competitive with the 
library MXM. The latter not only has an revalue of almost 200 MFLOPS, but 
obtains half this rate with a matrix dimension of only 11. The use of the library 
matrix multiplication routine would thus seem to be an ideal route to high perfor- 
mance on the CRAY X-MP, even for problems of rather small dimension. 

A comparison of matrix multiplication performance on several Cray com- 
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puters is given in Table 4. For all machines the highest rates are obtained with 
the library MXM routine. While the CRAY-2 performance still falls somewhat 
short of the maximum possible, experience has shown that this derives substan- 
tially from bank conflict problems created by competition with other user jobs. 

In stand-alone mode we have observed CRAY-2 MXM performance of more than 
420 MFLOPS. This is similar to the fraction of the theoretical performance ob- 
tained on the X-MP/48. Although n */2 f° r MXM on the CRAY-2 is larger than 
for the X-MP machines the values are still fairly small, so we can conclude that 
matrix multiplication is a route to high performance on the CRAY-2 as well as 
on the X-MP. 

It is quite straightforward to rationalize the different behavior of SAXPY 
and matrix multiplication. In the latter, considerably more arithmetic operations 
can be done on data once it has been read from memory into the vector regis- 
ters as compared to the former. There is therefore a much higher proportion of 
floating-point operations in the overall operation count in the matrix multipli- 
cation case and consequently higher performance. This observation is the key to 
programming Cray computers for high performance: to seek tasks which re-use 
data in the vector registers repeatedly, rather than using it once or twice. Ma- 
trix multiplication is probably the most obvious example of this approach, and we 
will generally try to show how it can be incorporated into user programs. Other 
related kernels, such as solving systems of linear equations, are also considered 
when we discuss particular computational chemistry tasks. In view of this central 
importance of matrix multiplication, we shall explore some aspects of it further 
here. 

Reference has already been made to more general matrix multiplication 
operations such as (III. 1 ). Although this form is not available in the SCILIB li- 
brary, not even in the simple form involving the constraints a = 1 (or 0) and 
j3 = ±1, several groups have prepared subroutines to perform this task. For the 
CRAY-2 the subroutine MXMPMA by Calahan et al. [23] handles the above form 
with the constraints a = 1 or 0 and /3 — ±1. This is a very efficient implemen- 
tation that exploits local memory on the CRAY-2. Another aspect of more gen- 
eral matrix multiplication is the exploitation of sparseness in the arrays A and B. 
Where sparseness has its origins in the symmetry of a problem, it is usually more 
advantageous to handle the symmetry explicitly, but it may often be useful to 
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take advantage of essentially random sparseness. Saunders’ routine MXMB [18] is 
widely used in this situation: its performance is similar to the CRI MXM routine 
for dense matrices so there is no loss of efficiency where there is little sparseness. 

Another generalization of the simple matrix product is the case in which 
either A or B is to be transposed before the multiplication. The SCILIB rou- 
tine MXMA allows for this possibility (and for other variations on the index- 
ing of the matrices, such as multiplication of sub-blocks of full arrays). The al- 
ternative approach would be to explicitly transpose the arrays as required. On 
the CRAY X-MP/48 it is our observation that there is little to choose between 
these strategies. On the CRAY-2 for most array dimensions there is also little 
difference, but there is a catastrophic loss of performance in MXMA when trans- 
posing arrays with dimensions that are multiples of 256 because of severe bank 
conflicts. In such cases the observed performance drops to some 18 MFLOPS, 
about 20 times slower than the best rate. Several smaller dimensions, such as 128 
or 64 also show a drop in performance, although not as great. While this can be 
overcome by embedding the desired matrix in a larger array to avoid the critical 
stride value, we normally prefer to explicitly transpose the matrix and use the 
routine MXM. Since, as noted, this strategy incurs virtually no penalty on the 
X-MP, we follow it on all Cray computers. This also increases the portability of 
programs, as it usually easier to find an analog of MXM on other machines than 
MXMA. 

The technique of obtaining increased performance on Cray computers by 
unrolling an outer loop is well documented [14], but it is perhaps less obvious that 
this approach can be extended to more than one loop. In the matrix multiplica- 
tion case this gives the following code fragment when two loops are unrolled to a 
depth of two: 

DO 10 J = 1,N,2 
DO 20 K = 1,N,2 
DO 40 I = 1,N 

C(I,J) = C(I,J) + A(I,K)*B(K,J) + A(I,K+1)*B(K+1,J) 

C(I,J+1) = C(I,J+1) + A ( I ,K)*B(K, J+l) + A(I,K+1)*B(K+1,J+1) 
30 CONTINUE 

20 CONTINUE 

10 CONTINUE 
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Unrolling the J loop enhances performance by reducing the number of fetch in- 
structions required for the elements of A, while unrolling the K loop reduces the 
number of fetch and store instructions for the elements of C. As A and C are dif- 
ferent arrays, these advantages are combined when both loops are unrolled, and 
the overall efficiency is increased. Formally, the efficiency increases as the un- 
rolling depth increases, but the finite number of vector registers, together with 
the limits on the size of code block the compilers can optimize imposes an upper 
limit to the useful depth of eight. CRAY-2 timings obtained with such a scheme 
(both loops unrolled eight times) are given in Table 5. Although the performance 
is considerably improved over the case of FORTRAN loops without unrolling, it 
is never as good as with MXM, so the latter would usually be preferred. An ex- 
ception might be the case of equation (III. 1 ) with a different from zero or one. 
The unrolled scheme can handle this case without the need to store a temporary 
product matrix separately, as would be required if MXM was used. 

For very large matrices there is some advantage in using specialized ma- 
trix multiplication algorithms that require fewer than 2n 3 floating-point opera- 
tions [24-26]. Many such algorithms use a recursive partitioning approach akin to 
Fast Fourier Transforms, in which the matrices are multipled by blocks using ex- 
pression rearrangement to eliminate some operations. The best of these schemes 
behaves as roughly ro 2 ' 5 [26], but a simpler scheme, due to Strassen [24], which 
behaves as about has recently been investigated on the CRAY-2 by Bai- 
ley [27]. We can compare this approach with the library routines and also with 
loop unrolling in FORTRAN. The results are displayed in Table 5. The routine 
MXMPMA is the assembly language program written by Calahan et al referred 
to above [23]. It is this routine that is used to perform the block matrix multipli- 
cations in the Strassen scheme. The performance figures for the latter are com- 
puted assuming 2 n 3 floating-point operations: for the large matrices (say, larger 
than 500x500) they are distinctly better than the other values. Note also that 
MXMPMA does not suffer from the bank conflicts that somewhat degrade the 
library MXM performance for the 256x256 and 512x512 cases. 

C. General performance. 

Observed performance, given as Too and ni/ 2 values, for a number of op- 
erations are given in Table 6. The SAXPY and matrix multiplication results of 
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the previous subsection are also reproduced for reference. For elementary vector 
operations like addition, the code produced by the CFT77 compiler out-performs 
the library subroutines on the X-MP and Y-MP, but the library versions run 
faster on the CRAY-2. Performance figures are also given for sparse vector oper- 
ations “SPDOT” and “SPAXPY” [22]. These allow for sparseness to be exploited 
in one vector: the SPAXPY for example is equivalent to the FORTRAN loop 
DO 10 I = 1,N 

AdNDEX(I)) - A(INDEX(I) ) + SCALAR*B(I) 

10 CONTINUE 

and is useful when vector B has been compressed down to non-zero values whose 
original addresses are stored in INDEX. SPDOT is simply the dot product equiva- 
lent. These kernels are useful in quantum chemical calculations: the performance 
of the library versions is quite good. 

Polynomial evaluation using Homer’s rule, such as 
DO 10 I = 1,N 

A(I) = ( (C3*X(I) + C2)*X(I) + C1)*X(I) + CO 
10 CONTINUE 

for the cubic case, is ver}^ efficiently vectorized by CFT77 on the X-MP and 
Y-MP, where for higher polynomial orders a substantial fraction (about 90%) 
of the maximum machine performance is obtained. This indicates that the com- 
piler is able to generate code to re-use data from the vector registers for this case. 
Polynomial evaluation is a less fruitful approach on the CRAY-2, where the con- 
vergence with increasing polynomial order is much slower. 

Element-by-element vector division is also shown in Table 6. Cray com- 
puters have a reciprocal approximation unit but no floating-point divide hard- 
ware, so division requires a reciprocal approximation, a multiplication to obtain 
an estimate of the quotient, followed by a Newton iteration to give the required 
accuracy in the quotient (see, e.g. Ref. 21). This step requires a subtraction and 
two more multiplications. Even if the subtraction is chained with one multipli- 
cation this process cannot run at more than 40 MFLOPS (strictly speaking at 
40 megaresults per second) on our X-MP/48, and the observed performance is 
noticeably less. However, if the division is part of a more elaborate computation, 
for which data can be held in the vector registers and re-used, the overall perfor- 
mance will be more satisfactory. The [2/1] rational fraction evaluation requires 



the same number of simple additions and multiplications as the cubic polynomial, 
and also requires division, with its contributing multiplications and subtractions, 
but the observed performance is still about two-thirds of the polynomial rate on 
the X-MP and Y-MP, and is equal to the polynomial rate on the CRAY-2. 

For many purposes it is necessary to evaluate special functions, and the 
performance of the FORTRAN library routines for several of these is shown in 
Table 6. The rates here are quoted in megaresults per second, and as these func- 
tions can require 20 or more floating-point operations for their evaluation the per- 
formance is seen to be rather good. It is interesting that this is one of the few 
areas in which the CRAY-2 outperforms the X-MP. On all machines the square 
root performance is substantially better than the other functions but is still not 
competitive with elementary vector operations. 

The last type of operation considered in Table 6 is data motion. On 
the CRAY-1 data motion was an expensive task, with a performance for a 
simple vector move of about 8 MW/s. With improvements such as hardware 
GATHER/SCATTER all of the current Cray computers perform much better 
than this, with vector move rates that are higher than operations like addition. 
Data motion on the CRAY-2 shows a somewhat worse performance ratio to the 
X-MP than that seen for elementary vector operations. 

D. Multitasking on Cray computers. 

As discussed in the Introduction, vectorization is not the only way to im- 
prove performance on those Cray computers with several CPUs: it is also possible 
to use several CPUs in parallel. The pros and cons of this approach, especially in 
the context of multi-user systems, have been discussed elsewhere [28]. Multitask- 
ing involves some overheads, so the total CPU time of a job will increase when it 
is multitasked [1,2]. Any timing improvement therefore comes (in a stand-alone 
machine) from a reduction in wall clock time. Except for the case of large user 
jobs that would be feasible only if run multitasked, where there is obviously no al- 
ternative, system throughput on a production machine is generally best served by 
minimizing user multitasking and maximizing multiprocessing (that is, “system 
multitasking”). There can be circumstances that demand a different approach, 
however. For example, we have previously discussed the “1/n rule” [28]: on a sys- 
tem with n CPUs there will be no risk of CPUs becoming idle because of lack of 



system resources provided users are allowed to acquire no more than 1/n of any 
resource. Clearly, if a user job requires , say, all of main memory the user should 
be encouraged to use all the CPUs, in order to keep the entire machine busy. 

This is thus another motivation for investigating multitasking. 

We concentrate here only on the performance obtained with multitasked 
codes. First, it should be noted that in any multiprogrammed system the oppor- 
tunity to use multiple CPUs will vary considerably with workload and the various 
tuning parameters in the job scheduler. As a consequence, the performance mea- 
sured for a multitasked job will commonty vary by a considerable amount from 
run to run, making it difficult to compare different approaches and different ma- 
chines. Some of the results presented here were therefore obtained during dedi- 
cated access to a given machine. Second, multitasking is not yet a commonplace 
activity on Cray computers, at least in our observation, and therefore some as- 
pects of multitasking have yet to be properly debugged. Finally, we should point 
out that, as a corollary of using dedicated time, some performance figures for the 
CRAY-2 are quite different from those given in the previous section. The reasons 
are discussed below. 

Table 7 contains performance figures for matrix multiplication using 
different numbers of CPUs on different machines. This is a macrotasked ma- 
trix multiplication in which the result matrix is divided into n column groups 
(for n tasks), as described in detail in Ref. 28. The X-MP and Y-MP display 
an essentially linear scaling with the number of CPUs employed, producing the 
impressive figure of more than 2.3 GFLOPS when eight CPUs are used on the 
Y-MP. The overheads associated with generating the tasks and acquiring CPUs 
are not included in Table 7, but if the matrix multiplication is repeated several 
times within the job, as would probably be the case in a production code, the 
mean overhead becomes negligible anyway. The CRAY-2 figures scale consider- 
ably worse than n, even in a stand-alone environment. This has been discussed at 
length elsewhere: it stems from bank conflicts in one task generated by the other 
tasks. In a production environment the performance is even worse, as more bank 
conflicts arise from other user jobs and from the context switching and job swap- 
ping that occurs in multi-user mode. It is this interference between tasks that 
makes such a difference between stand-alone and multi-user mode times for even 
single-tasked jobs on the CRAY-2; it is noteworthy that our stand-alone multi- 



tasked matrix multiplication performance results on four CPUs are about four 
times the multi-user mode single-tasked results of section IIIB. 

There is little value in repeating the above figures for microtasked ver- 
sions of the matrix multiplication. These show essentially the same performance 
improvements as does macrotasking in a stand-alone environment, while the im- 
provement possible in a production machine is an even stronger function of the 
workload than is the case for macrotasked jobs, so there is no simple way to 
quantify the performance observed. Microtasking is designed to utilize idle CPUs, 
and under UNICOS this appears to be implemented by giving extra tasks forked 
a very low priority (a high “nice” value, in UNIX terminology [29]). This means 
that these forked tasks will get a very small share (although not a zero share) of 
the CPUs, unless the number of user jobs in the system falls to such a level that 
only the forking task and its children remain, in which case all will utilize the 
CPUs. Obviously, in a dedicated environment a single microtasked job will uti- 
lize all CPUs fully and perform like its macrotasked equivalent, while in a normal 
production environment the forked tasks will accumulate almost no time and the 
result will be equivalent to the single- threaded version. On the other hand, an 
“idle CPU” under COS seems to correspond to a CPU not busy with a task with 
the same priority as the spa.wning task. Hence a. high-priority job under COS will 
generally be able to acquire CPUs as it desires, greatly improving its throughput 
(and degrading that of other user jobs), while a. similar job run at lower priority 
will see much less improvement (if any). COS microtasking can thus impact other 
users more than UNICOS microtasking, and for many COS production environ- 
ments microtasking may be discouraged. 

Overall, while multitasking can often improve performance, its use can be 
counter-productive in a multi-user environment. Under COS, at least, it appears 
desirable to have about eight jobs per CPU in the machine to ensure that no idle 
time accumulates, and at present (with Cray computers of up to four CPUs ca- 
pable of running COS) it should be possible in most production environments to 
keep a machine busy simply by multiprocessing user jobs. This might not be the 
case with 16 or even more CPUs, as the sophistication required in a job sched- 
uler to cope with keeping so many CPUs busy would be considerable. It may also 
be the case that for some environments in which very long jobs or jobs with very 
heavy resource requirements are run that it is not possible to schedule jobs to 


23 



keep all CPUs busy. Such situations are probably better suited to the use of mul- 
titasking, at least for some part of the time, A useful compromise for many users 
would be to have microtasked versions of various library routines (especially ma- 
trix multiplication) available, at least under UNICOS where there is less impact 
of microtasking on other users. In this way the multitasking would be transpar- 
ent to the user, but any idle CPUs could be utilized effectively. As noted above 
microtasked libraries are promised by CRI. 
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IV. Ab initio quantum chemistry 


A. General observations 

Many aspects of ab initio quantum chemistry program implementations 
for Cray computers have been discussed elsewhere [18,30]. We concentrate here 
on illustrating the general principles outlined above for efficient use of Cray com- 
puters, with examples from the MOLECULE [31,32] and SWEDEN [33] program 
systems. We discuss the evaluation and sorting of integrals, the optimization of 
SCF, MCSCF, and Cl wave functions, and also several less conventional meth- 
ods for different types of electronic structure calculations. As several authors have 
emphasized [34-36], the calculation of energy derivatives (gradients, Hessians etc) 
shares many features with the calculation of the energy itself, and for exigencies 
of space we shall only discuss the latter here. Most of the techniques described 
can readily be implemented in derivative calculations as well. 

There is no intention here to instruct the reader in the details of efficient 
programming; the “Optimization Guide” published by CRI for the X-MP se- 
ries [37] can be consulted for such material. We shall discuss some of the broader 
aspects of programming Cray computers, again under the assumption that the 
codes to be written must be fairly easily ported to other environments. 

B. Gaussian integrals and integral sorting. 

Several reviews of methods for evaluating Gaussian integrals, with spe- 
cial emphasis on vectorization, are available [38-40]: the comprehensive and lu- 
cid article by Saunders [38] is particularly recommended. The evaluation of one- 
electron integrals is straightforward and requires little time, and we deal only 
with two-electron integrals here. We shall concentrate here on special features 
of the MOLECULE integral program that are not covered by these reviews. 
MOLECULE evaluates integrals over symmetry- adapted linear combinations of 
contracted Gaussian functions; these contractions can be of either general [41] 
or segmented [42] type. Basis functions can be either Cartesian Gaussians or 
spherical harmonic functions — in the latter case it is possible to include not only 
“pure” spherical harmonics (Is, 2 p, 3d, 4/) but also the higher principal quantum 
number “contaminant” functions 3.s, 4p, etc. As the previous reviews of integral 
generation have only sketchily treated general contractions, symmetry adaptation 
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and transformation of Cartesians to a spherical harmonic basis (note that Saun- 
ders discusses in detail the calculation of integrals directly in a spherical harmonic 
basis), we shall treat these matters in some detail here. General contraction and 
transformation to spherical harmonics are both rather time-consuming, but the 
major part of the overall time is usually spent evaluating primitive integrals. 

The algorithm actually used for evaluation of primitive two-electron inte- 
grals in MOLECULE is based on the traditional factorization [43] into a product 
of terms each involving one Cartesian direction, and an “auxiliary function” term 
that must be evaluated by numerical approximation, symbolically 

I = J2x m Y m Z m F m (t ) (IV.l) 

m 

where m runs from zero to a limit given by the sum of angular quantum num- 
bers involved; F m (t ) is the “auxiliary function”, t depends on the relative distance 
between the various basis functions and their exponents. The efficient implemen- 
tation of this approach requires that all functions from a given shell (that is, that 
differ only in azimuthal quantum numbers) be treated together. The factors X m 
can be obtained by recursion, as shown originally by Boys [44] and as exploited 
in various forms by McMurchie and Davidson [45] and, recently, by Obara and 
Saika [46]. The recursion scheme used in MOLECULE has many features in com- 
mon with the latter approach, although it was originally derived some years ago 
to help implement efficiently the integral formulas given by Huzinaga and co- 
workers [43]. Like most integral algorithms, this lends itself to “extrinsic vec- 
torization” [18], that is, a suitable number of quadruplets of function exponents 
(in effect, a list of t values) are treated together, so the various manipulations in- 
volved in (IV.l) are performed for vectors of the desired length. This approach 
to vectorization is simplest if the terms grouped together are on the same set of 
four centers, so that when there are only one or two functions of a given type on 
each of the centers the vector lengths would be very short. The use of large prim- 
itive sets is becoming more common, however, and this produces satisfactory vec- 
tor lengths in most cases. The values in the vector of t values vary over a wide 
range in most calculations, and there is considerable sparseness to be exploited in 
the use of (IV.l). It is convenient to compress the vector to only non-zero values, 
compute the auxiliary function values for the compressed list and then expand 
the result list back for use in (IV.l). This can be done in practice without requir- 
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ing an index vector of the same length as the vector of t values because the latest 
version of X-MP CFT can vectorize the loop: 

J = 0 

DO 10 I - 1 , N 

IF (A (I) .GT. THR) THEN 
J = J + 1 
B(J) = A(I) 

ENDIF 
10 CONTINUE 

Since the vector of t values can number upwards of 50 000 terms for a large basis 
set, the elimination of the index vector can be very valuable on the X-MP . The 
EXPAND operation that is the reverse of this compression is also vectorized by 
X-MP CFT. 

Given then a set of primitive integrals, how can these be contracted effi- 
ciently using general contractions? This is, in effect, a four-index transformation 

= X X X Xfa l")r pj r M -r r *T.,, (iv.2) 

p q r s 

but in cases where, say, six to ten primitives are contracted to two or three con- 
tracted functions, the vector lengths in (IV.2) are simply too short for effective 
performance when formulated in the conventional matrix multiplication scheme 
(discussed in section IVD below). It is preferable to use a different approach for 
each of the four partial sums in (IV.2): if the (pq\rs) are arranged in a rectan- 
gular matrix I with column index p and a “compound” row index qrs , the first 
partial sum for (IV.2) can be written 

^ = ( IV - 3 ) 

v 

For N primitive functions and n contracted functions this is, in effect, a matrix 
multiplication of an N 3 x N matrix by an N x n matrix [47], where the more 
conventional scheme would have rs fixed and would thus involve only N X N ma- 
trices. As the Cray library matrix multiplication routines are organized to per- 
form the work in the order that gives maximum vector lengths, the scheme (IV. 3) 
will be vectorized with an inner loop of length N 3 . The next partial sum requires 
a “transposition” of the result array I / to give indices q and the compound rsi , 
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but this can be done straightforwardly either by an explicit transposition loop 
or, implicitly, by using the Cray library routine MXMA (see section IIIB above) 
to multiply matrices with non-unit stride between elements. For the CRAY-2, 
as discussed above, the explicit transposition has the virtue of reducing the per- 
formance losses arising from bank conflicts: no extra memory is required for the 
transposition as the transposed V can overwrite the original I. The simplest solu- 
tion is thus to use MXM throughout. 

Once the contraction has been performed for all sets of angular quantum 
numbers in a batch, it is possible (if desired) to transform to a basis of spherical 
harmonic functions. There are several advantages to such a transformation. First, 
and perhaps most importantly, if only the pure spherical harmonics are used the 
dimension of the basis (and therefore the length of the integral file) is reduced. 
Even the decrease in d shells from six components to five can bring about a useful 
reduction in the length of an integral file, while for a diatomic molecule the elim- 
ination of the contaminants from a [5s 4p 3d 2/ 1 g] basis will reduce the length 
of the integral file by considerably more than half. Again, the size of the matri- 
ces that are involved in this process is rather small, especially for the commonest 
higher angular momentum cases (d and / shells), so that it is advantageous to 
use the same scheme as is used for the contraction transformation. However, the 
arrays of transformation coefficients from Cartesians to spherical harmonics are 
rather sparse, especially for the lower angular quantum numbers, so it is useful 
here to have a matrix multiplication scheme that can exploit sparseness. Never- 
theless, even with the Cray library matrix multiplication routines the work asso- 
ciated with the transformation to spherical harmonics is a rather small fraction of 
the total integral time [32]. Incidentally, it may be useful to retain the contam- 
inants (and perhaps even eliminate the high angular spherical harmonics) if de- 
sired, as methods that need continuum-like functions may require fewer functions 
of 8p-type to describe a p - wave than they would of the usual 2 p functions. 

While the transformation from contracted Gaussian functions to 
symmetry-adapted linear combinations (restricted in MOLECULE and in our dis- 
cussion here to D 2/1 and its subgroups) could be regarded as a four-index trans- 
formation, various formal simplifications make an alternative approach prefer- 
able. Most symmetry-adaptation procedures [48,49] can be viewed as implemen- 
tations of the method of double coset decompositions presented by Davidson [50]. 
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A symmetry- adapted integral is given as 

iPaqpV-iSs) = N )x J (9k)x 6 (9i)(9iP 9jQ\9k r 9i&), (IV.4) 

i j k l 

where cm... indexes irreducible representations, are operators from the point 
group Q, p... are atomic basis functions, and N is a constant involving both nor- 
malization of the symmetry orbitals and selection rules on symmetry species. 

This expression can be replaced [50] by the simpler form 

(Paq^r^ss) = N' EXE X I3 (9j)x 7 (9k)x 6 (9l)(p gjq\gi<r us)- (IV.5) 

J I< L 

Here N' now incorporates the rules which select the unique a given the three 
other irreducible representations. It should be noted first that the number of 
summations in (IV.5) is three, compared to four in (IV.4): (IV.5) clearly involves 
less work. Second, the range of J, K and L in (IV.5) is easily restricted so that 
only unique atomic integrals need be computed and employed in obtaining the 
symmetry-adapted integrals [48,50]. (IV.5) can be vectorized very simply: as a 
group of atomic integrals with the same angular properties and centers are mul- 
tiplied with the same factors in (IV.5), each term in the triple summation can 
be viewed as an element in a SAXPY operation. Thus the “symmetry transfor- 
mation” can be vectorized as a set of SAXPY operations with length n 4 (in the 
above notation). In fact, by forming the symmetry integrals for multiple quadru- 
plets of irreducible representations at the same time, the atomic integral values 
can be re-used several times once they have been read into registers, further im- 
proving the efficiency. 

It can be seen from the foregoing paragraphs that the various operations 
we have discussed will involve either considerable data motion, or the use of non- 
unit strides. Thus the contraction of primitive integrals requires the processing of 
quadruplets of primitives for each quadruplet of angular quantum numbers, while 
the transformation to spherical harmonics requires the processing of quadruplets 
of angular quantum numbers for a given quadruplet of contracted functions. Fi- 
nally, the generation of symmetry integrals requires the processing of different 
quadruplets of centers, for a given quadruplet of angular quantum numbers and 
contracted functions. Clearly, rather sophisticated index manipulations are re- 
quired. Finally, we have tacitly ignored the use of permutational symmetry: in 
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particular, when the pairs of primitives that form the sets of charge distributions 
are the same, numerous square blocks can be reduced to triangular arrays of the 
distinct elements. This also complicates the index manipulations. 

CRAY X-MP/48 timing results for some integral calculations on the 
molecule N 2 are given in Table 8. The basis set used is a (13s 8 p 6d 4/ 2 g) 
primitive set contracted to [5s 4 p 3 cl 2 / 1 g] using an atomic natural orbital gen- 
eral contraction [51]. The contracted set comprises 140 Cartesian Gaussians, or 
110 spherical harmonic basis functions when the contaminants are removed. We 
shall use this N 2 calculation for timing comparisons throughout this section. In 
Table 8 it can be seen that the transformation to spherical harmonics requires 
some 65 seconds on the X— MP/48, but the resulting integral file is less than one- 
third the length of the Cartesian case. For this N 2 calculation, the saving in time 
in all later steps as a result of using spherical harmonic functions totals less than 
35 seconds, so there is a net increase of CPU time required when spherical har- 
monics are used. The substantial reduction in external storage required makes the 
trade-off worthwhile in most circumstances, however. 

Table 8 also contains results for the N 2 system treated in lower than the 
maximum possible (in MOLECULE) D 2 i , symmetry. The C 2v group used here 
has the C 2 axis along the bond, so the two centers are treated as inequivalent. 

The C 9 group is a subgroup of this C 2v group. As the symmetry is lowered, the 
integral time increases, although only slightly on going from D 2 h to C 2v . In this 
case there is a compensation between a reduction in overhead (as there are no 
equivalent centers to be generated), and an increase in the number of integrals to 
be calculated. 

A comparison of the performance of the integral generation code on dif- 
ferent Cray computers is presented in Table 9. The integrals are generated in 
somewhat less time on the CRAY-2 than on the X-MP/48: this is rather unusual 
since for most steps in electronic structure calculations we usually observe the re- 
verse. The improved performance on the CRAY-2 may derive in part from the 
use of somewhat more memory — 5 MW on the CRAY-2 as opposed to 2 MW 
on the X-MP — which allows longer vectors to be used in the primitive integral 
calculation and reduces some overhead. The Y-MP performance for the integral 
generation is very good: the Y-MP version uses 3.5 MW of memory, so some 
extra improvement relative to the X-MP is expected beyond the shorter Y-MP 
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clock period. What is observed is actually more than a factor of two, considerably 
larger than is seen for other codes. No ready explanation for this behavior offers 
itself. 

In addition to the parallelism discussed so far, integral generation lends 
itself readily to coarse-grained multiprocessing [28]. As each batch of integrals is 
computed independently, and assuming the integrals can be produced in more or 
less arbitrary order (as discussed below they will generally need some reordering 
later anyway), it is possible to compute different batches in parallel. In fact, in- 
stead of spawning a new task for each batch it is simpler to divide the range of 
the four-fold loops over shells in the integral code and execute each subrange as 
a separate task. With such large granularity the overhead associated with macro- 
tasking becomes an insignificant fraction of the total time, and in a dedicated en- 
vironment with n CPUs throughput improvements of very close to n times are 
seen [28]. 

As will become clear in the succeeding subsections, it is highly desirable 
to have the integrals ordered on the integral file. This is seldom consistent with 
efficient integral generation, especially when symmetry is used both to reduce 
the number of distinct integrals and in obtaining final integrals over symmetry- 
adapted functions. Consequently, it is usually necessary to reorder the integral 
file. The reordering of the one-electron integrals is, of course, trivial and we will 
discuss only the two-electron case. The input/output aspects of this process, 
especially the use of direct access storage, have been comprehensively reviewed 
before [52], we will therefore concentrate here on vectorization and discuss in- 
put/output only where it has some impact on vectorization. 

The MOLECULE program generates four files of two-electron inte- 
grals, partitioned according to the symmetry block structure of the integrals. For 
D 2 h and its subgroups, there are four types of allowed quadruplets of symmetry 
species: aaaa, a/3a/3, aa/3/3 and a/3-yS, where a etc denote irreducible repre- 
sentations. Each of these types is written to a separate file, as each type can be 
sorted independently of the others to give the final ordered list. Each raw integral 
file comprises non-zero integrals together with labels containing the irreducible 
representations of the four basis functions and their offsets within symmetries. 

The first step on reading a buffer of integrals is to unpack these labels and iden- 
tify the position(s) in the final ordered list for the given integral. The label un- 
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packing, which consists of Boolean operations, and the computation of the final 
position, which consists of multiplications and additions, can all be vectorized 
over the number of elements in the buffer. The label packing in MOLECULE en- 
sures that the compound indices in the label are strictly non-increasing within a 
label, so that no conditional structures are necessary in vectorizing the label pro- 
cessing during the read of the raw integral file. Although the calculation of the 
position addresses involves (slow) integer multiplication and division, X— MP CFT 
and CFT77 can generate code for performing these operations in the floating- 
point units (the CFT compiler requires the directive FASTMD [37]). This im- 
proves the performance of the loops involved by up to a factor of 20, and of the 
program as a whole by more than a factor of two. If the number of ordered inte- 
grals of a particular type is small enough to fit in main memory (which will of- 
ten be the case for Cray computer memories and high symmetries) the final po- 
sition addresses simpfy represent SCATTER pointer elements into the ordered 
list, and the raw integrals can therefore be placed in the desired locations with a 
SCATTER operation, again with a length equal to the number of elements in the 
buffer. 

When the number of ordered integrals of a given type is too large to fit 
in main memory, a bin sort of the Yoshimine type [52] is emploj^ed. The final po- 
sition indices are used to identif} 7 the bins for ea.ch integral and its label. Here, 
because of the possibility that a bin may be full and require emptying before an 
integral/label pair can be written to it, we encounter a step in the re-ordering 
that cannot straightforwardly be vectorized. When these bins are read back, how- 
ever, the only processing required is to use the final position index, offset by the 
start of the subrange of integrals in a particular chain of bins, as a SCATTER 
pointer for the integrals. Hence this part of the work is again vectorizable with a 
length equal to the number of integrals in a bin. On Cray computers this length 
can usually be several thousand. In fact, since the SCATTER operation will move 
about 40 MW/s on the X-MP or 20 MW/s on the CRAY-2, it would appear to 
be necessary to obtain at least the same I/O transfer rates into memory to pre- 
vent the re-ordering from becoming I/O bound. With head contention and rota- 
tional delays, and disk performance on noil-striped mass storage systems, this is 
hardly possible if the direct access bin file is disk-resident. However, the trans- 
fer rate from an X— MP SSD into memory is around 156 MW /s per SSD channel. 
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(Although some configurations have two SSD channels, it is extremely unlikely 
in practice that a single user job will simultaneously have access to more than 
one channel.) Thus if the bin file is entirely SSD resident the re-ordering step 
will never become I/O bound. With current SSD sizes up to 512 MW and with 
gigaword (GW) sizes available in the near future this will normally be the case. 
SSDs are not available for CRAY-2 computers, but for a 256 MW CRAY— 2 it will 
very often be possible to re-order the integrals entirely within memory. Only for 
large basis sets (upwards of 200 basis functions) and low symmetry will it be nec- 
essary to use a bin sort and direct access disk, and so for most calculations the 
re-ordering will not be I/O bound on the CRAY-2. 

For high symmetries (which normally give multi- centered symmetry or- 
bitals) and small molecules the only sparseness in the integral list will come from 
symmetry. However, for lower symmetries and larger systems (that is for larger 
distances between basis functions) there may be increasing sparseness in the in- 
tegral list from the neglect of very small integrals in MOLECULE. It will often 
be useful to exploit such sparseness to reduce the length of the ordered integral 
list, as well as the raw integral files. On the other hand, if this is done simply by 
adding a label word to each integral the sparseness must be greater than a 50% 
reduction in order to see any reduction in the length of the list. Instead, the in- 
dex locating a non- zero integral in the full list can be packed into the lowest bits 
of the integral itself. As each buffer of ordered integrals fixes one pair index ij y 
it is only necessary to pack the kl pair index, for which it is convenient to use 
16 bits. This involves a loss of between four and five decimal places in the man- 
tissa, which is acceptable for most applications, given the 14 digit precision on 
Cray computers. The examination of the ordered integrals, prior to writing them 
out, to check whether their magnitude exceeds some threshold, can be viewed as 
building a GATHER pointer vector (strictly, a COMPRESS pointer): the subse- 
quent GATHER and the packing of the GATHER pointer elements into the low- 
est bits of the integrals can then be completely vectorized. The entire process can 
be viewed as a “compressed index” vector operation on the X-MP [11], and under 
CFT this can be vectorized by the compiler without any need to allocate space 
for the GATHER index vector. The subsequent processing of such “compressed” 
integrals is discussed below. 

Timing for ordering various integral files is given in Table 8 for N 2 calcu- 
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lations on the X-MP/48. When spherical harmonics are used in D 2 h symmetry, 
the ordering can be carried out entirely within 2 MW of memory. This is another 
useful trade-off against integral generation time when using spherical harmonics: 
the ordering takes five times longer overall when Cartesians are used, and unless 
about 3.5 MW of memory is available requires either direct access disk or SSD 
space. It should be noted that a request for this memory would incur a prior- 
ity penalty on our X-MP/48, according to the “1/n” rule (see section HID and 
Ref. 28). When ordering in memory up to 1 000 000 ordered integrals can be pro- 
duced per second, while for out-of-memory cases the observed best rate depends 
on the symmetry. In high symmetry cases about 300 000 ordered integrals can 
be produced in one second, but for the low symmetries this rate rises to more 
than 500 000 per second. Table 9 contains results for integral ordering obtained 
on various Cray computers. All of the results given are for sorting in memory. 

The Y-MP results are somewhat faster than would be expected from the ratio of 
Y-MP and X-MP clock periods, although the speed increase is not as great as 
was observed for the integral calculation. The CRAY-2 is considerably slower in 
this step than the X-MP, because of its much slower memory. 

As we shall see, for some applications it is desirable to have a different 
combination of integrals, such as the “P-supermatrix” with elements 

V(ij\kl) = (ij\kl) - l/4[(ifc|i0 + (il\jk)}. (IV.6) 

For later convenience the “diagonal” elements ij = hi are usually halved. The 
reprocessing of the ordered integrals to obtain these values can be handled us- 
ing very similar techniques to those described above. In fact, as fewer symmetry 
blocks of P are usually required than of the integrals themselves, the processing is 
even simpler. 

C. SCF calculations. 

The most time-consuming part of most SCF calculations is the contrac- 
tion of integrals with density matrix elements to build one or more Fock matrices. 
For the closed-shell Fock operator F this process can be written as 

= \kt)D kl , (IV.7) 

k l 
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where D is the density matrix and V is the supermatrix introduced in the previ- 
ous section [53]. Operationally, however, this step is driven by reading blocks of V 
from file: blocks corresponding to kl values for fixed ij such that kl < ij give rise 
to two contributions 

DO 10 KL = 1 ,NKL 

F (KL) = F (KL) + D(l J)*P(KL) 

10 CONTINUE 

DO 20 KL = 1 ,NKL 

F(IJ) = F (I J) + D(KL) *P (KL) 

20 CONTINUE 

The first of these loops is clearly a SAXPY operation and the second is a dot 
product. Hence these steps are immediately vectorizable. Further, if the vector 
P has been compressed to onfy non-zero values these two steps can be viewed as a 
sparse SAXPY and a sparse dot product, and are again vectorizable as described 
in section IIIC above. A more complete discussion of this aspect can be found in 
Refs. 28 and 54. 

For open-shell energy expressions [53,55,56], additional Fock operators 
must be built from other combinations of integrals, such as the K supermatrix, 

K{ij\kl) = l/2[(*fc|j'0 + (il\jk )} , (IV.8) 

and open-shell density matrices. Such work is obviously vectorizable in complete 
analogy with the V processing described above. If multiple open-shell densities 
are employed, the vectors of K values can be re-used from vector registers, enhanc- 
ing performance. Alternatively, it is possible to read the V and 1C supermatrices 
in separate tasks, thereby utilizing more than one CPU simultaneously. It is also 
possible to use multiple CPUs to process different parts of the same supermatrix, 
as discussed in Ref. 28. However, if no sparseness is employed the loops for con- 
structing F run at 60 to 100 MFLOPS on the X-MP, and therefore (running the 
two loops sequentially) require reading up to 25 MW/ s from file to avoid becom- 
ing I/O bound. Any tactic to increase the effective rate of CPU operations, such 
as using multiple CPUs, obviously only increases the demands on the I/O system. 
One possible solution, as discussed previously, is to keep the supermatrix files on 
the SSD, another, discussed in more detail below, is to keep the files in memory 
on the CRAY-2. 
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The only other step in an SCF calculation that consumes anything but 
trivial amounts of time is the Fock matrix diagonalization, and for most appli- 
cations with up to 200 basis functions and symmetry blocking this step is quite 
inexpensive. Although specially optimized matrix diagonalization routines (such 
as the EISPACK library [57]) are available in CRTs SCILIB library, the require- 
ments for SCF calculations are usually so modest that simple Jacobi or Givens 
schemes are adequate. 

The fraction of the SCF time required for Fock matrix construction is il- 
lustrated by the N 2 results of Tables 8 and 9. For the high symmetry calculations 
50% or more of the SCF iteration time is used in this step. For the low symme- 
try calculations the diagonalization of the Fock matrix is relatively more time- 
consuming. For the D 211 case the Y-MP results of Table 9 are essentially what 
would be expected from its clock period, while the CRAY-2 results are rather 
poor, given that the sparse dot product and SAXPY performance is expected 
to be similar to that of the X-MP (see section IIIC). However, the X-MP and 
Y-MP versions of the SCF code read V in large fixed-length blocks, so part of the 
difference between these machines and the CRAY-2 may result from I/O over- 
heads on the latter. 

D. Integral Transformation 

The treatment of electron correlation requires that the integrals be trans- 
formed from the atomic orbital basis set to the molecular orbital basis set. The 
computational aspects of this step have been extensively reviewed by Wilson [58]. 
Transformation of integrals can be implicit in some approaches, such as the 
method of self-consistent electron pairs [59,60], or explicit, involving reading the 
file containing the AO integrals, transforming these to the MO basis, and writing 
a new file containing the MO integrals. Hybrid methods for treating the correla- 
tion problem are also possible, where some contributions are computed in the MO 
basis set and some directly from the AO integrals, thus requiring only a partial 
transformation. 

Without symmetry, the full transformation is of the order mn 4 , where 
n is the number of basis functions in the AO basis set and m is the number of 
MOs. First, it is obviously beneficial to minimize m, and the simplest step in 
this direction is to eliminate transforming to any orbitals which are not occu- 
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pied in the correlation treatment. Furthermore, the contribution to the energy 
arising from electrons which are not correlated (the core) is straightforwardly ex- 
pressed in terms of Coulomb and exchange integrals only (see, e.g. Ref. 61). This 
is commonly accounted for by constructing a core Fock operator to be added to 
the one-electron integrals. In our implementation, this core Fock operator can be 
constructed using the supermatrix formulation described above for the SCF pro- 
cedure, or directly from the ordered integrals without further sorting. Clearly, 
constructing the Fock operator from the supermatrix is faster provided the su- 
permatrix already exists. However, if the supermatrix is not available, the work 
required to construct one Fock operator directly from the ordered integrals is 
roughly equal to the time to form the supermatrix — this cost can be amortized 
over the iterations of an SCF calculation as the supermatrix is used a number of 
times, but for the core Fock operator construction we prefer to avoid the extra 
I/O of the sorting step and to construct the operator from the ordered integrals. 

In addition to the savings from avoiding transforming for core or deleted 
virtual orbitals, the overall work can be reduced by exploiting sparseness in the 
integral or coefficient arrays. While some reduction can be effected simply us- 
ing sparse vector or matrix arithmetic, in cases where the sparseness derives from 
the symmetry of the system it is preferable to handle the symmetry explicitly. 

The use of symmetry in the transformation can reduce the overall work by or- 
ders of magnitude: if a basis set of n functions is symmetry adapted with n/2 ba- 
sis functions in each of two irreducible representations, the work is reduced from 
one transformation of order n 5 to four transformations of n 5 / 32, or an eightfold 
reduction. Clearly, for systems with higher symmetry, such as D 2 / 1 , the savings 
would be even larger. It should be noted that the operation counts given here are 
based on the assumption that the two-electron integrals are available in a basis of 
symmetry orbitals. While there are schemes for obtaining the advantages of high 
symmetry in the SCF step without the formation of symmetry integrals [62-64], 
such schemes become very complicated for the transformation. However, as de- 
scribed above the symmetry transformation in the calculation of the integrals is 
very efficient, and the very large savings in the transformation (and SCF) steps 
that derive from it are real, not a consequence of moving work from the transfor- 
mation into the integral evaluation. Of course, the simple and efficient symmetry- 
adaptation procedure described in section IVB is restricted to D 2 h and its sub- 
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groups (although see Ref. 50), but very few treatments of electron correlation are 
implemented in higher symmetries anyway. 

We now discuss our implementation of the transformation, starting with 
the one-electron integrals. While this requires very little of the overall time, it 
illustrates several aspects of the transformation of the two-electron integrals. The 
transformation of the one-electron integrals, H, by the coefficient matrix, C, can 
be written as 

H' = C t HC. (IV. 9) 

but the one-electron integrals Hij are symmetric and therefore naturally stored as 
i > j. If the integrals are stored only as this lower triangle, it is difficult to vec- 
torize (IV. 9). However, if the integrals are expanded to a square matrix (a matrix 
T) the transformation (IV. 9) clearly becomes two matrix multiplications. The 
transformed integrals naturally form a square matrix, which can be compressed 
to a lower triangle before being written to disk. Based on the timings reported in 
section III above it is clear that a formulation in terms of matrix multiplications 
will be very efficient on Cray computers. Hence if the squaring of the AO inte- 
gral matrix or the compression of the transformed integrals is coded inefficiently, 
these processes could actually become the rate-limiting step. That is, the pre- 
and post-processing of the integrals, which are steps of order n 2 , could take longer 
than the n 3 matrix multiplication step, at least for any n that is likely to be en- 
countered in present quantum chemical calculations. Two efficient approaches of 
squaring (or of compressing) the matrix can be considered. In the first, a row is 
moved from lower triangular storage to a row and column in the square form. 

IX=0 

DO 10 I = 1,N 
DO 20 J = 1,1 

T(J,I)=H(J+IX) 

T(I , J)=H( J+IX) 

20 CONTINUE 

IX = IX + I 
10 CONTINUE 

In the second approach a vector of length n 2 is used to hold a mapping into each 
location in T from a location in H. Then by using a GATHER the matrix T can 
be easily formed. Similar procedures can be used for the compress. (Note that 
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many GATHER operations can be viewed alternately as SCATTER operations: 
we use the term GATHER to describe either coding organization). In this ap- 
proach there is obviously additional overhead in forming the GATHER pointers, 
but in the transformation of the two-electron integrals the same steps occur so of- 
ten that the cost of constructing the GATHER pointers once at the beginning of 
the calculation is effectively amortized to zero. 

Either approach to squaring a one-electron integral matrix works well on 
Cray computers. However, on a machine with a large overhead associated with 
starting vector operations, like the CDC CYBER 205 or some of the Japanese 
supercomputers, the GATHER implementation is to be preferred. We have there- 
fore implemented the GATHER approach since it gives us some additional porta- 
bility. 

For large basis sets the two-electron integral transformation poses some 
problems in data handling. Yoshimine [65] has described an efficient process for 
transforming two-electron integrals (pq\rs) into (ij\kl) that requires only a small 
fraction of either ( pq\rs ) or (ij\kl) to be in core at one time, that is, an out-of- 
core transformation. In successive steps a list of (pq\rs) is converted into a list 
of (zj|rs), which is then transposed to a list of (rs\ij) and transformed to a list 
of (kl\ij). If the ordered integrals are stored such that for each r > s all p > q 
values are present (this requires some double storage in the aaa<y and ocfiafi sym- 
metry blocks) it is possible to read in the “rs row” of the integrals, square the 
pq indices and perform a two-index transformation on this subset of the integrals 
analogous to the one-electron integrals. This set of transformed integrals (rs|zj) 
comprises all ij for fixed rs , and can be compressed to distinct integrals (z > j) 
only before being written to disk. We also compress the final transformed inte- 
grals before writing them to file such that only the unique values are stored. That 
is, any double storage in the aaaa and afiaft blocks is removed. 

Some additional considerations that were not encountered in the one- 
electron transformation affect the coding of the two-electron case. For example, 
the number of two-electron integrals can be very large and for extended systems 
and low symmetries there may be many accidental zeros. As noted in section IVA 
we have the option to store only those symmetry orbital integrals above a given 
threshold: each rs row of integrals is compressed to an array of non-zero inte- 
grals with the position index of each integral in the uncompressed row packed 
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into its low-order 16 bits, these are written out with the number of non-zero pq 
values. To expand these values to the full row we clear an array the length of 
the full row, mask off the low-order bits into an integer array, and use this inte- 
ger array as the pointer vector in a SCATTER operation. All of these steps are 
vectorizable, so the process is very efficient. Another aspect of the transforma- 
tion concerns the appearance of the transposed coefficient matrix C T in (IV. 9) 
above. Based on the observations presented in section IIIC, it is better to trans- 
pose a matrix explicitly and use the SCILIB routine MXM than to use MXMA 
to multiply by the transpose, as this minimizes problems with bank conflicts on 
the CRAY-2. Of course, in the two-electron transformation the same C and C T 
matrices are used repeatedly, so there is no loss in efficiency (and only a trivial in- 
crease in storage) if we store both C and C T and use MXM. Where the integral 
list is sparse because of accidental zeroes it would be advantageous to use a sparse 
matrix multiplication routine like Saunders’ MXMB [18], but at present we use 
only the Cray library routines. 

While formally the two-electron transformation process can be considered 
as two series of two-index transformations, another complication relative to the 
one-electron case is that a transposition of the results is required before the sec- 
ond series of two-index transformations. In our implementation the transposition 
is merged into the two-index transformations. That is, as each two-index transfor- 
mation is completed the resulting integrals are moved into bins for the direct ac- 
cess I/O. It is important to note that this process is actually a transposition and 
not a sort, so there is no index calculation: the first w elements go into bin 1, the 
second w into bin 2 etc; the value of w is determined by the available memory. 
This step can be coded as a series of vector moves into the bins. The transposi- 
tion step is also merged with the second half-transformation: after the contents 
of a chain of bins is SCATTERed into memory, all related two-index transforma- 
tions are performed and the fully transformed integrals are written to disk. The 
process is then repeated for the next chain of half- transformed integral bins. 

Another special circumstance in the two-electron transformation arises 
for symmetry types a(3ot(3 and afijS, since there is no permutational symme- 
try between r and s or between p and q in integrals (pq\rs) of these symmetry 
types. There is therefore no need to square raw integrals or to compress trans- 
formed integrals in these cases. However, handling the different cases does not 
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give rise to any noticeable overhead, a,s we separate the various symmetry types 
at a very high level (see IVB above). Overall, the inclusion of symmetry involves 
rather straightforward coding and introduces little overhead, but provides enor- 
mous gains in performance. 

Two-electron integral transformation timing data for our N 2 example are 
given in Table 9. The CPU times are effectively negligible in comparison with 
other steps (notably the integral evaluation) on all machines. As ordered inte- 
grals are used as input, and only a transposition is required after the first half- 
transformed step, the wall clock times are also very small. The CPU times given 
correspond to a rate of around 50 MFLOPS on the X-MP and CRAY-2, and 
65 MFLOPS on the Y-MP. There is actually rather substantial overhead in these 
times: if a transformation is performed for the C\ symmetry N 2 calculation de- 
scribed in section IVB the X— MP/48 CPU time required is 324 seconds, but the 
rate increases to 150 MFLOPS. Thus the use of D 211 symmetry in this case gains 
a factor of 38 in CPU time, but there is a concomitant threefold loss in perfor- 
mance because of the overheads involved in processing numerous rather small 
symmetry blocks. For a very large basis, in which the transformation overheads 
would be amortized to nearly zero, the use of D 211 symmetry would give an im- 
provement of more than two orders of magnitude over the no symmetry case. 

As expected, the out-of-core transformation described here works 
very well on Cray computers. However, on some other computers, such as the 
CDC CYBER 205, the large vector start-up overhead leads to poor perfor- 
mance for even the larger matrices. Therefore we have also programmed an in- 
core transformation. We use a modification of the Bender approach [66]. As in 
the out-of-core scheme we separate those cases with and without rs permuta- 
tional symmetry. The first half-transformation is replaced by two matrix mul- 
tiplications. By ignoring the permutational symmetry between pq and rs, (that 
is, the aaaa case is treated as aa/3(3 and the a (3 a j3 case as n/?7 <5), the second 
half-transformation can be written as long SAXPY operations. The redundant 
integrals for the aaaa and a (3 a f3 cases are eliminated using a precomputed 
GATHER pointer vector. This organization yields excellent performance on the 
CYBER 205. As this approach eliminates the transposition and I/O, it also yields 
essentially the same performance as the out-of-core approach on all Cray comput- 
ers, for those calculations for which the transformed integrals can be held in core. 
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By having both an in- and out-of-core transformation we thus have some 
additional portability without sacrificing any performance. The program de- 
scribed is so highly vectorized that the n 5 transformation step commonly repre- 
sents an insignificant fraction of the total time. This is very different from the 
situation on scalar machines. 


E. Full Configuration Interaction 

The full Cl (FCI) procedure, the use of all configurations that can be 
constructed from a given one-particle basis set, is an exact solution to the corre- 
lation problem for this basis set, and therefore stands as an unambiguous test of 
approximate methods. Due to recent advances in full Cl methodology (principally 
in vectorization) and developments in computer hardware it has been possible 
to perform an important series of benchmark calculations [67,68]. Of course, it 
is still not possible to perform FCI calculations for large basis sets and problems 
of chemical interest, but it is nevertheless instructive to begin our discussion of 
correlated wave function generation with the FCI method. Its simplicity makes 
it ideal for explaining some of the concepts of vectorizing other approaches, and 
FCI wave functions in a limited one-particle space lie at the heart of the CASSCF 
method. We therefore consider the FCI approach first and then proceed to the 
other methods. 


In most modern Cl calculations, with any type of configuration space, 
the Hamiltonian matrix is too large to be held in memory, and therefore an iter- 
ative diagonalization procedure is required to obtain the Cl energy eigenvalue(s) 
and eigenvector(s). This is true even on the CRAY-2 when configuration spaces 
of order 10 6 and more are to be handled. The Davidson diagonalization proce- 
dure [69,70] is most commonly used. This method has several advantages, one 
of which is that it imposes no constraints on the order in which matrix elements 
are processed. This allows the matrix elements to be computed in an order that 
achieves the maximum overall performance, whereas techniques that implicitly 
require access to, say, one row of the matrix at a time might lead to intolerable 
degradation of performance. The key is thus to form efficiently the residual vector 
cr, the product of the current estimate of the Cl eigenvector c and the Hamilto- 
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nia n matrix H. This step can be written as 

OK = £ £ £ £ £(p<7|rs)£*,r„ c L> (IV.10) 

p q r s L 

where B are the “coupling coefficients”. Siegbahn pointed out [71] that if the 
known factorization of B^, s into products of one-electron coupling coefficients 
(using a resolution of the identity), 

^ = £</Vh (iv.ii) 

J 

is explicitly inserted into (IV. 10), it is possible to vectorize the calculation of o 
very straightforwardly. First, the product of one set of coupling coefficients and c 
is collected in D, 

Dr S ~ A rs c L> (IV. 12) 

L 

This can be implemented as a set of SAXPY operations involving the elements 
of the very sparse matrix A. A matrix multiplication of the intermediate array D 
and a block of the integrals is performed, 

E pq = (IV. 13) 

rs 

The intermediate array E is then contracted with the second set of coupling coef- 
ficients to form a contribution to cr, 

^ = ££</V ? - ( IV - 14 ) 

J pq 

This operation is a matrix-vector product. It is clear that all steps in the calcu- 
lation of <7 can be vectorized, given that all the necessary quantities are available 
as needed. This is easily arranged for a, c and the integrals (the latter can be 
stored in memory for any MO space it is feasible to use in a full Cl calculation). 
The handling of the coupling coefficients A requires more attention. As there 
is a rather large number of A p q L elements these must either be pre-computed 
and stored on disk (sorted to an order which allows vectorization of (IV. 12) and 
(IV. 14)) or computed on the fly as required in these two equations. 
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Siegbahn originally introduced this method [71] to improve the perfor- 
mance and increase the size of the FCI step in CASSCF calculations. In this 
context it was considered appropriate to use spin eigenfunctions as the configu- 
ration basis and to compute and store a list of the coupling coefficients on disk. 
This would require a great deal of disk space for large calculations, but is accept- 
able for calculations with up to, say, 12 electrons and 12 orbitals in the FCI. For 
larger calculations, disk space would become excessive, even though the CPU time 
requirements would remain acceptable, so that disk space becomes the limiting 
factor. Knowles and Handy [72] showed that by using determinants instead of 
spin eigenfunctions as the configuration basis the non-zero values of A^ q L (all of 
which then become ±1) can be computed on the fly. The Cl vector c is several 
times longer in a determinant al basis, but this is no handicap on computers such 
as the CRAY-2. In this way FCI calculations can be performed using very large 
expansions, providing a means of benchmarking approximate methods as well as 
performing large CASSCF calculations. The trade-off of memory (and, perhaps, 
CPU time) for disk storage is, of course, not an uncommon feature of program- 
ming modern supercomputers: we discuss below other trade-offs that would not 
have been considered a few years ago when CPU power was the limiting factor. 

As the most time-consuming step in the FCI algorithm (IV. 12-14) is 
a matrix multiplication, the code is very efficient on Cray computers. This has 
allowed calculations as large 28 000 000 determinants to be performed on the 
CRAY-2 [73]. Such a calculation requires on the order of 100 minutes of CRAY-2 
CPU time per FCI iteration: something over 90% of this time is spent in matrix 
multiplication. As c, a and some scratch arrays must be held in core, the mem- 
ory required for such a calculation is about 60 MW, which is perfectly feasible on 
the CRAY-2. However, the Davidson diagonalization process requires the c and a 
vectors from the previous iterations. Therefore, just as storing the coupling coeffi- 
cients can exhaust the disk storage long before the CPU time become prohibitive, 
so can the need to store the previous c and a vectors. As Davidson noted [69], it 
is possible to “fold” all the previous vectors into one vector, effectively starting 
again with the current estimate of the eigenvector as the starting guess. While 
this can slow convergence somewhat, it can reduce considerably the disk storage 
required. In effect, the CPU performance and the high degree of vectorization, 
compared to the disk performance and space limits, mandate implementing the 
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folding procedure if the largest possible FCI expansions are to be considered. 

In addition to the larger dimension of the FCI problem using a deter- 
minantal basis, there is another difficulty: the potential collapse of a desired 
higher root of the secular problem to a lower root of a different spin symmetry 
due to numerical rounding [74]. This can be a severe problem in the application 
to CASSCF wave functions, where many roots of the secular problem may be re- 
quired in investigating problems related to spectroscopy. Recently, Malmqvist et 
al. [75] have developed a method for using spin eigenfunctions of the desired sym- 
metry (configuration state functions, CSFs) instead of determinants, but without 
the need to store the coupling coefficients on disk. In their approach the graphical 
unitary group representation of the CSFs is used (see, e.g. Ref. 76). The graph 
that defines the configuration space is split into an upper and lower portion, and 
at each graph node on the dividing line contributions to the product (IV. 12) are 
computed and accumulated using matrix multiplication. This requires subsets 
of coupling coefficient A^ q L for each portion of the graph, plus some “partial co- 
efficients” for cross-terms between the two portions. All of this coupling coef- 
ficient information can be held in memory, even for large configuration spaces 
(100 000 CSFs and more). Such an approach seems ideal for the CASSCF prob- 
lem where the number of active orbitals is limited, but may not be suitable for 
the large calculations used for benchmarking. It is clear that this is currently an 
active area of research, and as yet the best method of performing FCI calculations 
may not have been achieved. However, even the current approaches illustrate the 
very high level of vectorization that can be achieved in Cl methods for solving 
the correlation problem. They also illustrate the need for careful thought about 
trade-offs in disk storage, memory use and CPU time in order to maximize the 
size of problem that can be solved. 

F. Symbolic formula tape for multireference Cl calculations. 

The most general type of large-scale Cl wave functions in current use is 
that comprising a set of reference CSFs and all CSFs singly and doubly excited 
with respect to these reference CSFs. The first task in generating such a wave 
function is to evaluate the coupling coefficients involved in the various Hamil- 
tonian matrix elements. We begin by classifying the MOs into inactive, active, 
and secondary orbitals. The inactive orbitals are doubly occupied in all reference 
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CSFs (but can have other occupations in the Cl configuration space — orbitals 
which are always doubly occupied are absorbed into the effective one-particle 
Hamiltonian as described in section IVD). The active orbitals have different oc- 
cupations in different reference CSFs and the secondary orbitals are unoccupied 
in the reference CSFs. Classes of configurations can be defined by their internal 
occupation , or internal for short. An internal occupation consists of a particular 
sequence of active and inactive orbitals containing n, n ~~ 1 or n — 2 electrons al- 
together. Those internal occupations that contain n electrons give rise to valence 
configurations simply by enumerating the possible spin-couplings with which they 
can appear; valence configurations must have the same spatial symmetry as the 
desired Cl root and must differ from a reference occupation by no more than a 
double excitation. Internals with n — 1 or n — 2 electrons are incomplete, in the 
sense that to obtain CSFs it is necessary to add one or two secondary orbitals to 
the occupation. Evidently, a given n — 2 internal, say, can generate a set of CSFs 
by coupling in different pairs of secondary orbitals with different spin-couplings. 
Again, the resulting n — electron occupations are constrained to differ from the 
reference occupations by no more than a double excitation. By expressing a Cl 
configuration list in this manner it becomes clear that the same coupling coeffi- 
cient applies to a large set of matrix elements; the coupling coefficients are deter- 
mined essentially by the internals from which the CSFs are derived, so that the 
particular secondary orbitals that appear are irrelevant. Therefore, if the coupling 
coefficients are computed for all possible internal-internal interactions, all of the 
required Cl matrix elements can be computed using the appropriate integrals and 
these coupling coefficients. 

In the first direct Cl programs the unique coupling coefficients were com- 
puted by hand and coded into the program [77]. The types of wave function for 
which this is feasible are very restricted, and programs following this approach 
were available only for single reference configuration wave functions based on a 
closed-shell determinant or a UHF determinant. The direct Cl approach became 
much more general once Siegbahn [78,79] combined the factorization of CSFs 
(and of coupling coefficients) into internal and external parts with the graphical 
unitary group approach as formulated by Shavitt [80]. Several variations on this 
idea were implemented, but most suffered from some limitations in the calculation 
of the coupling coefficients, a step formulated as essentially a set of scalar opera- 
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tions. Once the coupling coefficients were evaluated, on the other hand, the calcu- 
lation of the eigenvalues was very efficient, just as in the FCI case, and we discuss 
this aspect in more detail in the next sub-section. The problem in the evaluation 
of the coupling coefficients was most severe in the case that the number of refer- 
ences actually used was a small subset of those possible for a given choice of ac- 
tive and inactive spaces, which is unfortunately a rather common case. One alter- 
native approach would be to avoid the unitary group and use a more conventional 
Cl approach. The internal occupations are generated, together with all possible 
spin-couplings to which they give rise (including coupling of one or two electrons 
in model external orbitals if required). The occupations are compared to iden- 
tify potentially non-zero matrix elements, and the coupling coefficients evaluated 
between prototype CSFs represented by the individual spin-couplings. Such an 
approach has been implemented [81], but although it can avoid some of the prob- 
lems associated with the unitary group approach, the calculation of the coupling 
coefficients is still not vectorizable. 

Recent work along quite different lines by Knowles and Werner [82] and 
by Siegbahn [83] has enormously simplified the calculation of the coupling coef- 
ficients. This new approach again exploits the factorization of the two-electron 
coupling coefficients into sums of products of one-electron coupling coefficients, 
as in (IV. 11). First, the internal occupations are generated. In computing the 
one-electron coupling coefficients a product form is used: a fictitious MO a is in- 
troduced which is unoccupied in all the internals, whereupon we can write 

M 

where M is an “internal” to which a has been coupled. The A^ a M values are very 
simple to calculate and the values can be held in memory. In evaluating the two- 
electron coupling coefficients, pairs of internals are compared and the type of cou- 
pling coefficients that arise are identified (these types are determined by the or- 
bital differences between the pair and the open shells they contain). The coupling 
coefficients between all CSFs that can be derived from the pair of internals are 
then evaluated by matrix multiplication of the stored one-electron terms. The 
corresponding formula tape entry comprises the internal labels, the label of the 
integrals (or blocks of integrals) that appear, and the coupling coefficients. In 
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this approach there is little redundant work even where only a few of the pos- 
sible reference occupations arising from a given active space are used. Further- 
more, unlike the approach using prototype CSFs described above, the evaluation 
of the coupling coefficients is highly vectorized. The program that we use was 
developed by Siegbahn [83] and performance better than 50 MFLOPS has been 
observed on the X-MP/48. Some timing results are given in Table 10. For our 
N 2 example the coupling coefficients for single and double excitations from a sin- 
gle closed-shell reference CSF (16 internal occupations) requires much less than 
one CPU second, while for the case of a CAS reference space that gives rise to 
2804 internals the coupling coefficients require 12.5 seconds. These very recent 
developments means that the rate-limiting step in an MRCI calculation is in the 
calculation of the eigenvector, and the excellent performance of the new approach 
allows very large reference spaces: the calculation of the coupling coefficients for 
more than 60 000 internals requires about 700 seconds on the X-MP/48. 

G. Multireference Cl eigenvalue determination 

The FCI calculations described above show that it is possible to achieve 
very high performance in the eigenvalue determination through the formulation of 
the time-consuming step in terms of matrix multiplication. The factorization into 
internal and external contributions in the multireference direct Cl (MRCI) case, 
with the consequent association of a set of CSFs and an array of Cl coefficients 
with each internal occupation, is also very well suited to a matrix multiplication 
formulation [81,84-86], and so again very high performance should be possible. 

On the other hand, there are substantial differences between the FCI and MRCI 
cases, resulting mainly from the fact that all orbitals in the FCI calculation are 
classified as part of the same orbital space and their number is so small that all 
integrals can be held in memory, while in the MRCI calculations there is a dis- 
tinction between occupied and secondary orbitals, and it is seldom possible to 
hold all the integrals in memory. For MRCI calculations, additional data-handling 
involving sorting both integrals and coupling coefficients must thus be performed. 
If are active or inactive orbitals and a, 6, c... are secondary orbitals, the 

types of interactions between internal occupations can be classified by the type of 
integrals required: (i|/i|j), ( « | /z | z ) , (ci\h\b), ( ij\kl ), (ai\jk), ( ab\ij ), (ai\bj), ( ab\ci ), 
and (ab\cd). Each class of integrals contributes to a limited number of types of 
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interaction. Therefore the determination of the eigenvalue first involves sorting 
the integrals into classes, and to a given order within classes. The formula file is 
sorted during its generation to have the same organization of types as the integral 
file. In each MRCI iteration the program loops over each type of interaction and 
within each type a block of coupling coefficients is read. If these coupling coeffi- 
cients refer to a new block of integrals, these are also read in. In our organization, 
the formulas and integrals are ordered so that the integral and formula files are 
read only once per iteration, and of the order of N^q integrals are required to be 
in memory at one time, where Nmo is the number of MOs. 

Each possible interaction is treated in a subroutine specific to the inte- 
gral class: the details of the treatment have been extensively reviewed by Saun- 
ders and van Lenthe [86], and we will not repeat them here. As shown in Ref. 86, 
most types of interaction can be reduced to multiplying a block of integrals by 
coupling coefficients, and then a subsequent matrix multiplication of the results 
by a block of Cl coefficients. (In practice, the block of integrals may actually be 
a sum of two different blocks multiplied by two coupling coefficients.) The final 
matrix product is added to a block of the <r vector, so there is some saving of 
memory if the matrix multiplication routine allows the product to be added to 
(or subtracted from) the destination array. 

Symmetry is used in all sections of the Cl code, just as in the transfor- 
mation. This includes the index permutation symmetry of the integrals as well as 
the spatial symmetry. If P indexes a particular n — 2-electron internal, for exam- 
ple, the coefficients of the distinct double excitations generated by coupling to P 
can be written as an array 

c a P b V a> b. (IV. 16) 

Such arrays display the same permutational symmetry as one-electron integrals: if 
a and b are of the same symmetry type the array is triangular, while if they are of 
different symmetries the array will be square. Some contributions to a will then 
be required only as lower triangles, as the blocks of a have the same structure 
as those of c. In such cases we must square the block of integrals and Cl coeffi- 
cients, in the same manner as in the transformation, form the necessary product 
and then fold the two halves of the product matrix together to add to the cr vec- 
tor. These squaring and folding operations are vectorizable, as discussed in IVD 
above, and take only a small amount of time. Therefore, while an MRCI calcula- 
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tion involves a far more complex organization than the FCI scheme discussed in 
IVE, each individual step is vectorizable and very high performance can be ob- 
tained. We have been able to routinely perform calculations involving 1 million 
CSFs and have on occasion performed calculations involving more than 4 million 
CSFs. 

Some timing results are presented in Table 10. For N 2 with the 10 va- 
lence electrons correlated and an SCF reference the direct Cl iteration time is less 
than 1 second on the X-MP/48. This calculation involves 17626 CSFs. The ob- 
served performance is about 44 MFLOPS: lowering the symmetry to C* 2 V dou- 
bles the iteration time but also improves the performance to 66 MFLOPS. In the 
limit of no symmetry about 140 MFLOPS would be obtained. Hence for very 
large basis sets, in which the overheads have become negligible, we may expect 
single reference direct Cl performance of about 140 MFLOPS on the X-MP/48. 

In multireference cases, the performance is somewhat higher. For example, tim- 
ing for an N 2 calculation with a CAS reference space (six active electrons in six 
active orbitals, giving 32 reference CSFs) is also given in Table 10. The iteration 
time of about 80 seconds for almost 730 000 CSFs corresponds to an average of 
50 MFLOPS. Hence in very large (or low symmetry) cases we may expect MRCI 
performance of better than 150 MFLOPS. A breakdown of the that part of the 
iteration time concerned with the three most time-consuming classes of integrals 
processed is also given in Table 10. The (ai\bj) integrals (this actually includes 
(ab\ij) integrals as well) require the most effort, followed in this case by {ai\jh) 
and (a6|cd). For very large basis sets the latter are expected to dominate, but 
this case is rarely observed in practice. Finally, we note that on the CRAY-2 the 
observed times are up to twice as long as on the X-MP/48: these N 2 calculations 
involve multiplication of rather small matrices and the matrix multiplication per- 
formance on the CRAY-2 will suffer somewhat. On the other hand, the iteration 
times on the Y-MP are about 2/3 of those on the X-MP, suggesting that large 
MRCI calculations will run at well over 200 MFLOPS on the Y-MP. 

H. CASSCF calculations 

The CASSCF approach can account for near- degeneracy correlation ef- 
fects and correlation contributions that vary rapidly with geometry [87]. It can 
thus supply an excellent zeroth-order description for use in the MRCI approach. 
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We employ a two-step, uncoupled MCSCF optimization scheme [88], which in- 
volves a transformation of the integrals, determination of an FCI wave func- 
tion, construction of a gradient vector and Hessian matrix for the energy depen- 
dence on orbital rotations, and solution of the simultaneous linear equations of 
the Newton- Raphson method for obtaining improved orbitals. In practice, far 
from convergence it is better to use a first-order approach for optimizing the or- 
bitals [89,90], in which case an approximate Hamiltonian matrix over single ex- 
citations must be constructed instead of the Hessian; this Hamiltonian matrix is 
then diagonalized in order to obtain improved orbitals. We now consider each of 
these steps in more detail. 

Just as for the MRCI case discussed in the previous subsection, different 
classes of integrals are required in a CASSCF calculation. For the Cl step, only 
integrals with four active orbital indices are required (the inactive orbital con- 
tributions can be absorbed into the one-electron operator, as discussed in IVD 
above), while for the gradient of the energy with respect to orbital rotations, 
integrals with three active orbital indices and one index that runs over the full 
MO space are required. Finalfy, for the orbital rotation Hessian, integrals with 
two indices running over the full space and two over the space of occupied (in- 
active and active) orbitals are required. We compute all integrals required for a 
given MCSCF iteration in the same transformation step. The two-electron inte- 
gral transformation is performed in a manner very similar to the full two-electron 
transformation described above in IVD, but there are some special features added 
because only certain classes of transformed integrals are required. For example, 
if m, n... denote occupied MOs and p, q... all MOs, it is clearly advantageous to 
obtain integrals (pq\mn) and (prn\qn) by first transforming integrals over in- 
dex n and then over q. In this way no step in the transformation scales worse 
than MiV 4 , for N AOs and M occupied MOs, an N/M - fold reduction over the 
full transformation. Further, the ranges used in the second half-transformation 
can obviously be modified depending on which MO indices appear in the half- 
transformed integrals. Some care is needed in this approach when symmetry is 
used. Thus if the integrals (p a <?a| m /? n /j) are required, where a and /? label irre- 
ducible representations, it would be necessary to perform a full transformation on 
the a symmetry indices at the outset if the aa(3(3 AO integral block is to be pro- 
cessed only once. Instead, following Roos [91], this block is processed twice in the 
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first half-transformation, once as aaflfi and once as /3f3aa. In this way the first 
transformation step always scales as MN 4 . The same strategy is employed for 
the a(3 'yS blocks as well. 

The FCI calculation is handled using the factorized coupling coefficient 
approach of Siegbahn [71], as described in section IVE. The construction of the 
orbital Hessian requires a very small amount of time and in our current version 
is still scalar code. The solution of the simultaneous equations is performed us- 
ing an iterative scheme analogous to the Davidson diagonalization [69]. As we are 
normally able to store the full orbital Hessian in memory, the matrix-vector prod- 
uct needed in the iterations can be performed one row at a time, it is then iden- 
tical to the construction of the Fock matrix (eqn IV. 7 above) and is implemented 
as a SAXPY and a dot product. (The handling of the approximate Hamiltonian 
matrix appearing in the first-order optimization scheme is very similar.) Note 
that for very large cases (or in the absence of symmetry) this step could also pro- 
ceed as for the Fock matrix construction by storing the ordered rows on disk and 
re-reading them in each iteration. If necessary, the same compression techniques 
as used for the SCF supermatrices could be employed, together with a sparse 
SAXPY and dot product. 

The time-consuming steps in a CASSCF calculation are thus vectoriz- 
able, and using these techniques we have been able to perform calculations involv- 
ing about 40 000 CSFs and a basis set of about 200 functions. For the N 2 exam- 
ple we have used in these discussions the CASSCF iteration time is dominated by 
the integral transformation (which takes only a few seconds on the X-MP/48), 
the orbital optimization step requires one second and the Cl time is negligible 
(and is anyway almost entirely overhead for this case — 32 CSFs). For a configu- 
ration space of some 3500 CSFs the Cl step requires about five seconds per direct 
Cl iteration, while spaces of 40 000 CSFs would require about 80 seconds on the 
X-MP. 


I. Avoiding I/O, direct methods 

One of the constraints on how efficiently calculations can be performed, 
as we have repeatedly noted, is the rate at which data can be retrieved from sec- 
ondary storage. SSDs provide a partial answer to this problem for X-MP (and 
Y-MP) machines, but these devices are not available for the CRAY-2. However, 



the large central memory provided on the CRAY-2 and planned for the CRAY— 3 
provides a possible alternative: that of memory-resident data sets. One obvious 
technique is to generate the integral list in memory for use in subsequent SCF 
steps. With the use of high symmetries (generating only symmetry-distinct inte- 
grals) and pre-screening techniques to eliminate small integral batches, it is pos- 
sible to perform calculations with 500 basis functions or more in 150 MW or less 
on the CRAY-2 [28,92]. It is advantageous to avoid storing labels with the inte- 
grals by processing the list using the same loop structure as was used to generate 
them: in the first “SCF iteration” the non-vanishing distinct integrals are com- 
puted, stored and used in the Fock matrix build, and a flag is set in an index list 
to indicate that a particular batch was computed; in subsequent iterations the 
same loops are executed, skipping all processing if the batch was not computed 
and simply retrieving the batch from memory for the Fock matrix build if it was. 
While there is some overhead associated with the repeated execution of the outer 
loops of the integral generation code, considerable storage is saved by eliminating 
labels. As the flag for each batch can be represented by a single bit the index list 
requires almost no space. The whole scheme requires little more CPU time than 
a conventional SCF scheme, but eliminates almost all I/O processing. This ap- 
proach has also been used to reduce the I/O associated with the perturbed SCF 
and CASSCF wave function generation in the ABACUS analytic derivative pro- 
gram [93,94]. 

A more drastic approach to the elimination of I/O is not to store large 
data sets (such as the AO integrals) but to recompute them as required. This is 
the philosophy behind the “direct SCF” scheme of Almlof and co-workers [47,95]. 
Viewed naively, this appears to represent a trade-off between the CPU time re- 
quired to generate the integrals and the I/O overhead (and storage requirements) 
associated with repeatedly retrieving them from secondary storage. However, it 
is perfectly possible for machines like the X-MP or CRAY-2 to generate the inte- 
grals over basis sets of 1500 functions or more in reasonable CPU times (say, on 
the order of hours); that is, to generate an integral list far larger than the capac- 
ity of most supercomputer secondary storage systems. In such a case the ques- 
tion of a trade-off does not even arise, as the conventional approach would not 
be feasible. Of course, since the integrals must be regenerated in each iteration, 
it is desirable to minimize the number of iterations and to eliminate as many in- 
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tegrals as possible, as well as using the most efficient scheme possible for com- 
puting the integrals. These aspects, together with timings, have been discussed 
elsewhere [47,95,96], as has an interesting hybrid approach with explicit storage of 
some integrals [97]. 

The extension of this direct approach to MCSCF and Cl calculations has 
also been discussed, but only in purely formal terms without computational im- 
plementation [98]. Very recently, however, Saebp and Almlof [99] have developed 
a program for computing the MP2 correlation energy using a direct approach. 
Such a scheme can be regarded as a first step towards implementing a Cl method, 
as well as generating results which are very useful in themselves. The MP2 en- 
ergy requires MO integrals of the form (ia\jb), which are most efficiently obtained 
not by transforming the charge distributions, as described above, but by trans- 
forming the first and third indices as the first pair, then the second and fourth. 

It is obvious that this requires an integral list that is four times longer than the 
“canonical” list, that is, double the length required for the conventional transfor- 
mation. In a direct approach to MP2, then, about four times as many integrals 
must be calculated as in an SCF iteration, so it would be expected that the MP2 
energy would require about four times the CPU time of a direct SCF iteration, 
assuming that integral generation is the dominant step. This is essentially what is 
observed in practice by Saebq and Almlof [99]. A similar approach has also been 
investigated by Head-Gordon et al. [100]. 

J. The influence of supercomputers on quantum chemistry. 

It is important to conclude this presentation of supercomputer imple- 
mentations with a discussion on how these algorithms and machines have influ- 
enced our approach to quantum chemistry. While it might be thought that per- 
formance gains of an order of magnitude or more would simply increase the size 
of systems considered, their influence is much more profound. For example, the 
ability to perform full Cl calculations in realistic basis sets has provided us with 
detailed benchmarks for other correlation methods [67,68]. These have shown 
that multireference Cl wave functions (with some correction for size-consistency 
effects if eight or more electrons are correlated) reproduce the full Cl results to 
very high accuracy. Since MRCI wave functions are thus an adequate solution to 
the correlation problem, we may infer that (assuming relativistic effects or Born- 
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Oppenheimer breakdown terms are negligible) any discrepancies between MRCI 
results and experimental observations are due to inadequacies in the atomic or- 
bital basis. While this had been hypothesized on a number of occasions [101-103], 
full Cl calculations (being more practical than the use of a complete orbital ba- 
sis) were required to confirm it. It then becomes worthwhile exploring the pos- 
sibility of using better basis sets together with MRCI wave functions, and su- 
percomputers provide the necessary computational resources to allow the use of 
atomic natural orbital basis sets [51], which have led to almost chemical accuracy 
(1 kcal/mol) for systems such as N 2 [104], and to even higher accuracy for CH 2 
and SiH 2 [105]. 

Another way in which the power of supercomputers influences quantum 
chemistry is the speed with which results can be obtained. Even if results of a de- 
sired accuracy can be obtained on, say, a, VAX-type minicomputer, the real time 
required to obtain the results may be too long for the calculation to be useful. 

It is an important consequence of using a supercomputer that results can be ob- 
tained in relatively short times, and this factor will doubtless increase in impor- 
tance as supercomputer performance increases. 
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V. Dynamics 

A. General observations 

While much of the effort in computational quantum chemistry goes into 
implementing a few, rather well explored methods, this is much less true in scat- 
tering calculations. It is generally necessary to consider a wider range of possi- 
ble methods and their various implementations, and the optimum course is often 
much more application dependent. We shall review here a greater variety of dif- 
ferent methodologies than was the case for quantum chemistry, but there is still 
no intention to review the entire field; we concentrate on approaches that we have 
explored and used. 

B. Classical dynamics 

The simulation of collisions by the quasi-classical trajectory method can 
be broken down into three steps: the specification of the initial conditions for the 
trajectories, the integration of the equations of motion to determine the final con- 
ditions, and the analysis of the the final conditions to extract rate data [4,5]. The 
initial coordinates and momenta are determined from quantities which fall into 
two categories, namely those which are fixed, like the total energy or initial quan- 
tum state, and quantities such as orientations, which are not experimentally re- 
solved. The unresolved quantities must be averaged over, and this necessitates 
the determination of many different trajectories. 

Computationally, the most expensive step is the integration of the trajec- 
tories. It is therefore advantageous to consider the savings possible by vectorizing 
this step. The problem is to determine the coordinates q t and their conjugate mo- 
menta pi at some time after the collision, given values before the collision. These 
are determined by solving Hamilton’s equations: 


dpi/clt = 

—dH/dqi, 

(V.l) 

dqi/dt — 

■■ dH/dpi. 

(V.2) 


Here H is the Hamiltonian and i runs from 1 to the number of degrees of freedom 
in the system. Thus once the center of mass motion is removed, Hamilton’s equa- 
tions comprise a set of 6iV — 6 coupled first-order differential equations, where N 
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is the number of atoms. In the absence of external fields, the derivatives of the 
Hamiltonian become dH/dqi = dV/dqi and dH/dpi = dT/dpi , where V is the 
potential energy of the system and T is the kinetic energy. 

Many different algorithms have been used to solve Hamilton’s equations, 
however in most methods the time-consuming steps consist of forming the quanti- 
ties fj , the vector of time derivatives of the pi and qi at intermediate step j, and 
then using them to predict new coordinates and momenta. For later convenience 
let y n be the vector of pi and qi at time step n. The operations for predicting 
new coordinates and momenta are typically SAXPY-like operations which can be 
evaluated reasonably efficiently, as demonstrated in previous sections, provided 
the vector lengths are great enough. If a single trajectory is being integrated, 
then the vector lengths would be 6N — 6, or 12 for A+BC collisions and 18 for 
AB + CD collisions. These lengths are insufficient to give execution rates which 
approach the ultimate capabilities of vector pipelined machines. Further, the vec- 
tor lengths involved in the calculation of the gradients of the potential which con- 
tributes to fj will be even less. Vectorizing a single trajectory is thus not a very 
efficient use of Cray computers. 

It is fortunate that it is possible to do much better by taking advantage 
of the fact that many trajectories are required. Because the integration of each 
trajectory is independent of all others, several can be processed simultaneously. 
That is, new vectors Y n and Fj can be constructed by simply concatenating the 
vectors y n and fj from different trajectories; these new longer vectors can then be 
used in the predictions of new coordinates and momenta. The vector lengths in 
these steps can thus be made equal to (6iV — Q)N tra j, where N tra j is the number 
of trajectories integrated simultaneously. This means that with little difficulty the 
asymptotic rates can be reached for the operations that are performed. However, 
the integration is not the entire calculation, and several other steps need to be 
considered before predictions of overall execution rate can be made. 

We note first that although the integration of each trajectory is indepen- 
dent of the others, the initial conditions are not. This is because the random sam- 
pling used to generate the initial conditions relies on pseudo-random number gen- 
erators which require knowledge of one or more previous pseudo-random numbers 
in order to generate the next in the sequence. Thus in general this initialization 
part of the calculation must be performed in scalar mode. Fortunately, it is of 
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sufficiently small size that the scalar operations do not contribute significantly to 
the overall run time. 

An additional possible complication is that if variable step size algo- 
rithms are used to integrate Hamilton’s equations, then operations which are 
more dependent on the specific trajectory are required to predict the coordinates 
and momenta. This would disrupt the vector processing discussed above. Con- 
sequently it is necessary either to use a fixed step size algorithm, or to modify 
existing variable step size methods to treat the different trajectories in a more 
uniform manner. The optimum solution will be problem dependent, and in many 
cases the simplicity of a fixed step size algorithm will out-weigh the advantages of 
a variable step size algorithm. 

The final aspect to consider is the evaluation of the time derivatives /j, 
that is, the right hand sides of (V.l) and (V.2). For systems for which cartesian 
coordinates are used as the qi , the derivatives of T are simply mass factors times 
the pi and thus are relatively trivial to form compared with the other parts of the 
calculation. Thus the calculation of dpi/clt and dqijdt , given values of pi and qi 
is mainly spent evaluating the quantities dV/dqi. This evaluation can be further 
broken down into two steps. First of all, it is unlikely that the potential function 
will be known explicitly in terms of the integration coordinates qp. typically the 
interatomic distances might be used to express the potential but Cartesian coor- 
dinates for the qi . Thus it is necessary to first compute the gradients with respect 
to some set of internal coordinates Q n , and then to use the chain rule to obtain 
the final derivatives. Usually the manipulations required for the chain-rule calcu- 
lations are straightforward and inexpensive compared to the last remaining step, 
the evaluation of the gradient of the potential. For systems in which a realistic 
potential is used, this step will usually dominate all other steps. This is simply 
a manifestation of the complexity of a typical expression for the gradients of the 
potential compared to the expressions for the integration algorithms. 

For relatively simple potentials, it can be advantageous to optimize the 
various operations taken by the integrator more carefully — see Ref. 106 for an 
example of how this can be done. 

The operations required to generate the gradient of the potential depend 
on the representation of the potential: typically, mathematical functions like ex- 
ponential and square root, and various trigonometric functions are required, as 
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well as the usual arithmetic operations. In addition, functions requiring table 
lookups, such as splines, are sometimes utilized, thus special effort is required to 
produce an efficient code. However, in contrast to procedures which rely heav- 
ily on matrix multiplication, the maximum execution rate which is ultimately 
achievable is usually well below the theoretical hardware limits. This arises for 
two reasons. First, the expressions for the gradients often contain common fac- 
tors which are the result of single operations: these cannot be chained with other 
operations and such steps will produce a maximum of a single result per clock pe- 
riod. Second, the possibilities for minimizing memory traffic by unrolling loops or 
by using several vectors (which remain in the vector registers) for more than one 
operation are limited. This is again due to the complexity of the expressions for 
the gradients and the many different vectors involved. Thus while all of the time- 
consuming steps for classical dynamics calculations can be vectorized, the vector 
speedups obtainable are limited because of the relatively inefficient mapping to 
the hardware. 

Perhaps the best way to improve the performance of the construction of 
the gradients of the potential is simply to minimize the number of evaluations re- 
quired that is, to use an efficient variable step size integrator. However, this can 
cause inefficiencies elsewhere, because of vectorization difficulties, and the various 
parts of the calculation must be judiciously balanced for overall efficiency. One 
example where the extra work involved in a variable step size algorithm paid off 
was in calculations on the recombination of hydrogen atoms [107]. Here one H 2 
molecule will have an energy near or above the dissociation limit, and at certain 
times this energy can be mostly kinetic energy, in which case small time steps will 
be required, while at other times the energy will be mostly potential energy and 
the slow velocities would allow larger time steps. Since most of the time will be 
spent in regions of configuration space where the energy of the diatomic is mostly 
potential energy, a variable step size algorithm can provide significant savings 
over a fixed step size algorithm. The variable step size algorithm implemented for 
this problem was a modification of the Bulirsch-Stoer method [108]. This method 
has been shown to be an efficient choice for the scalar integration of trajectories 
[4], and has the additional advantage that minor modifications allow the vector- 
ization of most of the operations involved in the integration. We only outline the 
ideas here: full details are given in Ref. 107. The basic philosophy of the method 


59 



is that to propagate the solution over some time increment, a series of integra- 
tions using a low-order method and smaller and smaller step sizes is performed. 
The results of each individual integration will not necessarily be very accurate, 
but accurate results can be obtained by extrapolating the results of the different 
step sizes. Based on the number of integrations required to obtain an accurate 
extrapolation, the time increment to be used next is predicted, that is, the step 
size is adjusted. In typical scalar implementations of this method, the number of 
low-order integrations per time increment is increased until the extrapolation con- 
verges within some tolerance. However, a similar scheme would not be very fea- 
sible for the integration of several trajectories simultaneously. Thus the method 
was modified to perform a fixed number of low-order integrations per time incre- 
ment. Then, based on the errors in the extrapolations using the different num- 
bers of integrations, the results for that time increment are either discarded and 
the time increment for that trajectory decreased, or the results are saved and the 
time increment adjusted to reach some error goal. Thus the scalar operations for 
the variable step size part of the code consist only of error checking, saving or dis- 
carding the results and adjusting the time increment. These steps amount to such 
a small fraction of the overall process that this algorithm performs very efficiently 
on the Cray machines. 

As an illustration of the discussion above, we offer the times given in 
Table 11. Here we show timings using the variable step size Bulirsch-Stoer inte- 
grator for two systems. The first system is F -f H 2 using the simple Muckerman 
No. 5 LEPS potential [109]. This system was chosen to give an idea of the limit 
of a three-body system with a simple potential. The second system is H 2 + H 2 
using the accurate potential from Ref. 107 which is based on extensive ab initio 
electronic structure calculations, and represents the extreme case of a complicated 
potential for a system with many degrees of freedom. Both calculations use in- 
teratomic distances as the internal coordinates Q n with respect to which the po- 
tential gradients are directly calculated, and Cartesian coordinates having the 
center of mass stationary at the origin as integration coordinates [107]. The times 
are normalized so that the integration time is 1 unit on the X-MP/48, and the 
MFLOPS rates are determined from the hardware performance monitor on the 
X-MP/48 and by scaling by the appropriate times for the other machines. The 
number of simultaneously integrated trajectories ( N tr aj ) was 500. 
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First, consider the F + H 2 system. Even when using the very simple po- 
tential employed here, a considerable fraction of the total time is spent evaluating 
the gradients of the potential with respect to the integration coordinates. The 
amount of time spent in the integration routine amounts to only 36% to 51%, de- 
pending on the machine. Thus in spite of the modest execution rate of the vari- 
able step size part of the code, a reasonable overall rate is achieved. Of the time 
spent getting the gradients, about 20-27% is spent calculating the interatomic 
distances and transforming the gradients to the integration coordinates. Consider 
now the H 2 + H 2 system. Here the complexity of the potential is obvious — 94% 
of the time is spent evaluating the gradients of the potential with respect to the 
internal coordinates in spite of the fast execution rates of 100 - 190 MFLOPS. 
Here since only 3% of the time is spent in the integrator, considerable flexibil- 
ity is available in optimizing the variable step part of the code by introducing 
more scalar operations, without significantly degrading performance. The rela- 
tive CPU time per integration step is about a factor of 20 greater for H 2 + H 2 
than for F + H 2 . 

The overall execution rates for either of the two potentials on the vari- 
ous machines range from 100 to 190 MFLOPS, with the Y-MP at the top of this 
range and all the other machines clustered at the bottom. These rates are well 
below the theoretical hardware limits, which is probably a reflection of the limited 
optimization ability of current compilers when faced with the complicated expres- 
sions present in these codes. 

At this juncture, we discuss other resources required by the classical dy- 
namics calculations, namely disk and memory. As a rule, the variables required 
for the various operations required to propagate a trajectory are of sufficiently 
small number that they all can be held in memory, thus very few 1/ O operations 
are required during the integration of a trajectory. An exception to this is if the 
coordinates and momenta along the trajectories are desired for analysis purposes, 
such as plotting. Then it will be necessary to perform a certain amount of I/O 
in order to save these quantities for later use. However, the majority of trajecto- 
ries will not be so analyzed, and thus it is only necessary to save the initial and 
final conditions. The memory allocated by our program has not been minimized, 
mainly because we have not encountered problems obtaining the space desired. 
Much of the space is taken up by temporary vectors, which a more sophisticated 
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compiler could allocate dynamically, thus reducing the overall space requirements. 
The present code used for H 2 + H 2 requires 720 x Nt ra j words for the integration 
part of the code and 589 x N tra j words for the potential gradient part of the code. 
Thus when using N tra j = 500, which is our usual production value, the code re- 
quires less than one million words of memory. 

This section will be closed with some ideas on how the timings for trajec- 
tories could be improved. Since most of the time is spent on evaluating the gra- 
dients of the potential, we will concentrate on this aspect of the problem. With 
the recent availability of ab initio gradient methods, it would seem advantageous 
to directly fit the gradients of the potential. Then one would be evaluating dif- 
ferent functions for the various gradients rather than differentiating a single func- 
tion. Although this procedure has many conceptual merits, it has several possible 
problems that limit its usefulness. First of all, the fitted gradients will probably 
not all integrate to the same function, that is, the potential will be nonconserva- 
tive [110] and so the trajectories will not conserve energy. However, in practice 
this may not be a significant problem. Another problem is that often both the 
potential and the gradients are required. For instance, it is often desirable to cal- 
ibrate classical methods by comparing to quantum mechanical solutions of the 
problem, and quantum mechanical dynamics methods require the potential it- 
self rather than the gradients. In addition, hybrid dynamical methods exist [111] 
which treat some degrees of freedom using classical mechanics and some using 
quantum mechanics. Another facet of this reflects the large amount of labour 
usually involved in constructing a fit to the potential: it not uncommon for sev- 
eral different studies using different methods to be performed using a potential 
once it has been fitted. It is thus less restrictive if the potential is given rather 
than the gradients. Finally it may simply not be as efficient to evaluate separate 
fits to the gradients as it is to explicitly differentiate a function. This is because 
there may be many common expressions in the differentiation formulas. For ex- 
ample, if we make the assumption that the gradients will be fitted by a function 
of similar complexity as the potential, then a comparison of the number of gradi- 
ents times the time to evaluate the potential to the time to evaluate the poten- 
tial gradients will give some idea of the efficiency of the method. For the simple 
F + H 2 potential used in the timings above, the time to generate the potential 
and three gradients is only 1.3 times as long as just generating the potential, and 


62 



for the H 2 + H 2 potential of Ref. 107, the time to generate the potential and six 
gradients is only 2.3 times the time for just generating the potential alone. Thus 
for these cases, it is more efficient to deal with the potential and then explicitly 
differentiate it. 

B. Quantum dynamics 

In the differential equation approach to quantum mechanical dynam- 
ics calculations of nonreactive molecular collisions, the equations to be solved 
are [112] 

^lf(r-) - D(r)f(r) = 0, (V.3) 

where r is the distance between the centers of mass of the target and projectile, 
the f are the unknown radial functions and D is the coupling matrix. The matrix 
elements of the coupling matrix are given by 

D nn , = p < n|V* n< | n' > +S n „, [~k 2 n + l n (l n + 1 )/r 2 ], (V.4) 

where < n\V int \n' > is the matrix element of the interaction potential between 
the basis functions labeled by the indices n and n 1 , fi is the reduced mass for the 
collision partners, k ^ is the square of the wave vector for channel n: 

kl = 2 n(E - e n )/h 2 , (V.5) 

E is the total energy, e n is the internal energy for channel n, and l n is the orbital 
angular momentum quantum number for relative translational motion for channel 
n. The basis functions labeled by n describe all degrees of freedom except r and 
each value of n defines what is known as a channel. The interaction potential is 
defined as that part of the potential which goes to zero as r goes to infinity, thus 
it does not include the binding potential of the collision fragments. 

The boundary conditions for the unknown functions are 

/nn'(0) = 0, (V.6o) 

;„f )] - Snn' exp [i(k n r - l n f )]}, k 2 n > 0; 
Snn r exp[ ^ 

(V.66) 


and as r goes to oo, 

^ ( (2 ik n )~^ { S nn > exp[-i(k n r - 
\ i(2\k n \)-% {8 nn > exp[|& n |r] - 
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From the matrix elements of the scattering matrix, S nn > , experimental observ- 
ables can be calculated using standard formulas [113-115]. 

To determine the scattering matrix, we convert the problem of determin- 
ing the f from a boundary value problem to an initial value problem by determin- 
ing a linearly independent set of solutions of (V.3), called f which are regular at 
the origin but have arbitrary slopes there [116]. In practice it is not usually neces- 
sary to start at the origin but rather at any larger distance where the hard core of 
the potential is sufficiently repulsive so that the radial wave functions are negligi- 
bly different from zero. At large r these functions behave as 


fr 


{ 


Pnn' sin(fc n r l n 2 ) T Qnn 1 COS (& n r l n 2 )} k n 6 ? 


P nn' exp[|fc„|r] + Qnn' exp[— |Ar n |r], 


kl < 0 , 


(V.7) 


with P, P, Q, and Q determined from f at two points or the logarithmic deriva- 
tive at one point. From P and Q, the scattering matrix can be easily determined. 

The number of algorithms for solving the close-coupling equations (V.3) 
is quite large [117-121], and the optimum algorithm is case dependent. The num- 
ber of coupled equations, the dimension of D in (V.3), will be called N and can 
range from 1 for central potential problems to over a thousand for anisotropic 
AB + CD collisions [122], thus the vectorization strategies will depend on the size 
of the problem. In contrast to problems in classical mechanics, the integration of 
(V.3) for a single channel (N — 1) can be vectorized relatively effectively. This is 
because the interaction potential depends only on the parameter r, which can be 
made to take on known values while in classical problems the potential depends 
on the unknown functions which are being determined. Thus the steps required 
for integrating the close- coupling equations for a single channel typically would be 
to determine D n (r), to determine temporary quantities depending on ffn(r) for 
single values of r, and finally to recursively assemble the solution. The first two 
steps are completely vectorizable, with vector components corresponding to dif- 
ferent values of r, while the final step is not; however, if properly coded this final 
step will not involve many operations per r value and hence will not be the rate- 
determining step. For example, in Ref. 123, scattering calculations were carried 
out at complex energy searching for poles in the scattering matrix. This required 
calculations at many different energies. The algorithm used was a modification of 
the R — matrix propagation algorithm [124-126] where the equations defining the 
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modified method are 


= o, 

(V.8) 

R? = i w /(J^ i_1) +? (i) ), 

(V.9) 

fiiMii/dr\ r=rt+hl/i ) = ~Rf + r<°, 

( v - 10 ) 

= -[ffr 2 , 

(V.ll) 

(/•> = rj'* + rj'"" 1 ', 

(V.12) 
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II 

(V.13) 

P = cosh(A (i) /i (,) ), 

(V.14) 

P 3 (i) = A (f) sinh(A (i) ft (i) ), 

(V.15) 

A« 2 = D„(r i ). 

(V.16) 


In the above equations, is the width of sector i and is the center of sector 
i. Equations (V.8) and (V.10) need be evaluated only once per calculation, Eq. 
(V.9) is the recursive step and the other equations are vectorizable with vector 
lengths equal to the number of sectors, which is typically on the order of hun- 
dreds. Thus significant vector speedups are obtainable. It should be noted that 
as in the case of classical dynamics, the vectorized steps here do not map as well 
to the Cray architecture as operations such as matrix multiplication do, so that 
extremely high execution rates are not possible. 

For calculations with large IV, it will be more efficient to vectorize the 
operations required for the individual integration steps. The majority of the time 
will be spent on standard matrix manipulations on matrices of order N by IV. 

For large IV, these operations will dominate the calculation because they scale as 
IV 3 . Since (depending on the algorithm) various numbers of matrix multiplica- 
tions, matrix inversions, linear equation solutions, and matrix diagonalizations 
will be required per integration step, the choice of a particular algorithm will be 
based on the relative work of the various operations, that is, the coefficient mul- 
tiplying N 3 , and also on the execution rate of the operations, which introduces 
another coefficient multiplying the operation count. We now consider these last 
points in detail. 
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In Table 12, we quote estimates of the asymptotic execution rates and 
matrix order required to achieve half the asymptotic rate for several matrix op- 
erations. The operation count for large enough N will scale as C X N Z , with C x 
a coefficient depending on the operation. For matrix multiplication we take 
Cmxm — 2, for general linear equation solution we use Cgls — S/3 (2iV 3 /3 for 
LU decomposition, and 2iV 2 per right hand side for forward elimination and back 
substitution), for symmetric linear equations we use half this value: Csls — 4/3. 
Finally for real symmetric matrix diagonalization, we estimate Crs = 10. These 
operation counts are based on the discussions of Ref. 127. The data in the table 
were determined by averaging the results obtained by fitting the CPU times and 
MFLOPS rates with matrices of order 63, 127, 255 and 511 to N 2 (t s + Ntoo ) or 
10~ 6 C x /(too + ts/N) by least squares. These particular orders where chosen to 
be close to multiples of 64 to make most efficient use of a Cray computer’s 64- 
element vector registers while at the same time avoiding multiples of 2, which can 
cause catastrophic bank conflicts as discussed in Section III above. For example, 
the execution rate for 512 is a factor of 20 less than for 511 on the CRAY— 2 when 
using the general linear equation routine LUSOLV. We should also point out that 
by fitting only these above array dimensions we obtain different n 1 / 2 values from 
those in section III above. In particular, the fit includes only array dimensions 
considerably larger than the true n\j 2 values for matrix multiplication. The fitted 
xi\ / 2 results for this case are too small and should not be taken literally. 

For matrix multiplication, we give results for the Cray routine MXM. For 
diagonalization, we only quote results for the EISPACK [57] routine RS, which we 
have found most convenient and reliable for our scattering calculations, although 
other routines may be more efficient [128]. In the following comparison we will 
emphasize asymptotic execution rates (Voo), but for finite matrices the riij 2 values 
are also important parameters. 

For linear equations, we give results for four different routines. The 
first, called LUSOLV, is a FORTRAN code utilizing loops unrolled to a depth 
of 16 [14,129,130]. The next is the Cray library routine MINV [22]. The last two 
are LINPACK [131] routines, one for a general matrix and one for symmetric ma- 
trices. Comparing the four methods for linear equation solution, we see that it is 
never advantageous from a CPU time point of view to use the specialized sym- 
metric matrix routine, for its execution rate is always much more than a factor of 
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two less than the other routines. While it uses less memory, since only a triangle 
of the matrix needs be stored, the execution rates are so poor that this is unlikely 
to be of sufficient motivation to use it. On the X-MP/48, the execution rates of 
the general UNPACK routine and the library routine MINV are similar, but the 
LUSOLV program is about a factor of two more efficient then those routines. On 
the Y-MP the situation is similar, with the LUSOLV program being the fastest, 
followed by the general LINPACK routine, and then MINV, with relative ratios 
1:1.2:2.5. On either CRAY-2 machine the situation changes, and the Cray library 
routine MINV is almost a factor of two faster then LUSOLV, and more than a 
factor of 2.5 times faster then the general LINPACK routine. Thus the most ef- 
ficient routine to use on the X-MP or the Y-MP is the FORTRAN program LU- 
SOLV, whereas on the CRAY-2 the most efficient routine is the CRTs MINV. 

The relative efficiencies of the various operations can be estimated by 
comparing the values (see also Section III). Since matrix multiplication is usu- 
ally the most efficient operation, it is convenient to introduce the ratios TZrs and 
7 Zle which measure the asymptotic rates compared to matrix multiplication for 
diagonalization and the best method for linear equation solution. On the X-MP 
and Y-MP we have 7 ~ 1.2 and 'R-le = U while on the CRAY-2 these ra- 
tios are 2.6 and 0.91-1.0, depending on the memory speed. On the X-MP and 
Y-MP, therefore, these various matrix operations are approximately equally ef- 
ficient, while on the CRAY-2 the diagonalization code is much less efficient then 
the others. However on all Cray machines, the nj /2 values are much shorter for 
matrix multiplication than for the other operations, so for small to moderate ma- 
trices, the diagonalization and linear equation solution operations will be rela- 
tively less efficient then the above discussion indicates. We should again point out 
that these timings (especially CRAY-2 values) are subject to about 10% fluctua- 
tion, depending on system loads. 

Based on the timings reported above, it seems desirable to maximize the 
number of matrix multiplications and minimize the other operations in the overall 
scheme. One algorithm which does this, at least in principle, is the DeVogelaere 
method [132], which only requires matrix multiplications. However in practice, 
two aspects diminish the attractiveness of this method. First of all, the r step 
size is limited by the oscillation of the wave functions, that is, a certain number 
of integration steps will be required per De Broglie wavelength. Thus for long- 
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range potentials or high energy collisions, an extremely large number of integra- 
tion steps will be required. In contrast, a number of algorithms exist where the 
integration step size is controlled by the change in the interaction potential and 
thus large steps can be taken where the potential is slowly varying. The step sizes 
required for accurate results for these methods are only weakly dependent on to- 
tal energy also. 

The second difficulty with the DeVogelaere method is one common to all 
initial value methods that explicitly determine the radial functions f. At small r, 
all channels will be classically inaccessible, and thus the functions will be grow- 
ing exponentially. If the classically forbidden region is too large, then the compo- 
nent of the functions which has the largest exponential growth will become suf- 
ficiently larger than all other components that (to working precision) the linear 
independence of the initial conditions will be lost. The determination of the scat- 
tering matrix will then not be possible [132]. This can be controlled to a certain 
extent by periodically reorthogonalizing the columns of f [133], but this stabiliza- 
tion procedure introduces extra operations and a small integration step may be 
required to limit the exponential growth per step. In practice, for molecular col- 
lisions where it is necessary to include channels which are classically inaccessible 
for all r, this difficulty can be a severe problem. 

An alternative solution is to devise an algorithm which does not explic- 
itly determine the f but rather some other quantity which will be inherently sta- 
ble. For example, the logarithmic derivative matrix f -1 <if/dr will not suffer from 
stability problems. 

Because of these numerical difficulties, we have found the R — matrix 
propagation algorithm to be very attractive [124-126]. In this method, the in- 
tegration step size is usually limited by the r derivative of the coupling matrix 
D, so large step sizes are possible in the asymptotic region where the interaction 
potential is slowly varying, which is particularly advantageous for long-range po- 
tentials. In addition, the negative of the inverse of the logarithmic derivative ma- 
trix is propagated, which avoids all stability problems. A final advantage is that a 
large fraction of the work required is independent of total energy E so that effort 
can be saved by performing calculations for several energies. A disadvantage is 
that operations other than matrix multiplications are required. 

The operations involved per integration step for this algorithm are as 
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follows. The energy independent steps are to construct the coupling matrix D 
(which is real and symmetric) in sector i , to diagonalize it to determine the local 
adiabatic eigenvectors T l and eigenvalues [A^] 2 , and then to form the overlap 
matrix 

T(i-l,i) = [T (M) fT (i) . (V.17) 

For calculations at subsequent energies, it is only necessary to save the local adia- 
batic eigenvalues and overlap matrices. Thus for large TV, the relative time for the 
energy independent step will be JV 3 ( 2 -f 107?. ^s). Since the total energy enters in 
Eq. (V.4) only as a multiple of the unit matrix, the local adiabatic eigenvectors 
are independent of E and the eigenvalues are shifted by a constant amount if E 
changes. Then at each energy it is necessary to form 


Ri° = ri i) -p^]- 1 {Ri < “ 1 ) T(i-l,i)+r(i-l,Ori°}- 1 r(i-l ) «)[P? ) ]" 1 , (V.18) 


where 

Rf = -f [df/dr]- 1 | r=rj+fti/2> (V.19) 

r» = [p'Y’pf, (V.20) 

[P < 1 ‘ , ]„„' = 8 nn . coshfA^/t,), (V. 21 ) 

and 

= «„„•*{? sinh(A WAj). (V.22) 

Since P 3 J and rp are diagonal, the time-consuming operations required for 

(V.18) are one matrix multiplication and one linear equation solution, and so we 
see that the relative time for the energy dependent step is iV 3 ( 2 + 8/31Zle )• If 
we consider large scale calculations on the CRAY-2, where we take 7 Zrs as 2.6 
and 7Zlu as one, then we see that the ratio of the energy independent time to 
the energy dependent time is six, thus it is advantageous to perform calculations 
at many energies — performing calculations at seven energies would require only 
about twice as much CPU time as one energy. (Note that an earlier version of our 
code [130, 134] used a different sequence of matrix operations which was less effi- 
cient both in time and memory usage.) This assumes that the time for the evalu- 
ation of the matrix elements of the interaction potential required for (V.4) is neg- 
ligible. This is not always the case, even though the work to construct this matrix 
should scale as N 2 for large enough N . If the interaction potential time is not 
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negligible, then the ratio of the energy dependent times to the energy indepen- 
dent times will be even greater. Strategies for optimizing the interaction potential 
matrix evaluation will be discussed below. 

The storage requirements for the R —matrix propagation algorithm are 
relatively small, considering the amount of memory available on the CRAY-2. 

The matrices which need to be stored are of order N x TV , and the number re- 
quired is not large. For a single energy calculation, our present code requires 
seven matrices of this size, excluding the data required to form the interaction po- 
tential matrix. Usually we keep these matrices in memory, but some calculations 
carried out on a CRAY-1 used a version of the code which kept only two matrices 
in core and the remainder on disk. For the values of N we used in those calcula- 
tions (440 and 530), the I/O requirements of this strategy did not significantly de- 
grade the performance of the method. This would be especially true on a machine 
with a large SSD. Other quantities which need to be stored are those required to 
evaluate the potential coupling matrix. The requirements here may or may not 
be considerable, depending on the potential used. In our large scale HF -f HF 
calculations, using a complicated potential, our code required approximately an 
additional 40JV 2 words of storage for angular integrals and pointers (see below for 
a description of the construction of the potential matrix). Fortunately, this data 
is accessed sequentially, thus comparatively little is lost by storing it on secondary 
storage like disk or an SSD. 

If more than one energy is to be used, there are two storage choices. The 
first choice is to finish the calculation at the first energy before performing any 
part of the calculation for other energies. This requires that an additional N step 
matrices be saved (the sector overlap matrices), where N step is the number of in- 
tegration sectors used. The second choice is to perform the calculations for all 
energies at sector i before going on to the next sector. This requires 8 -f- Ne ma- 
trices, where Ne is the number of energies used [134]. Thus since N step is usually 
on the order of hundreds and Ne on the order of tens, it is usually more efficient 
to make the second choice. 

We now turn to the question of evaluating the interaction potential ma- 
trix. It is usually most efficient to transform to the body-fixed frame of reference 
to evaluate the interaction potential matrix. In this frame the kinetic energy part 
of D is no longer diagonal, but the coupling is very simple [135]. It is necessary 
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to back-transform to the laboratory frame only when the asymptotic analysis to 
determine the scattering matrix is carried out. For A -f- BC collisions we need to 
compute the body-frame matrix elements 


< 


vjQJMP\V int \v'j'njMP >=2tt J dR J dcosx Yf a (x,0)Q* vj (R)V int (r,R,x) 

*YMx,0)9*j'(R) 


(V.23) 


where v and v' are vibrational quantum numbers, j and j' the diatomic rotational 
angular momentum quantum numbers, J the total angular momentum, M is the 
projection of J on the laboratory -frame z axis, ft the projection of j, j' and J 
on the body-frame z axis, P is the parity, R is the diatomic bond length, r the 
distance between A and the center of mass of BC, x ^ ie an gl e between the vec- 
tors from A to the center of mass of BC and C to the center of mass of BC, Yjci 
is a spherical harmonic, and Q v j is a vibrational function. To arrive at (V.23) we 
have averaged over the Euler angles rotating an arbitrary laboratory-fixed coor- 
dinate system to the body fixed coordinate system: this produces a Kronecker 
delta for ft, J, M, and P, and the factor 2ir . The traditional way to proceed is 
to expand the cos x dependence of the interaction potential in terms of Legendre 
polynomials and then perform the angular integrals analytically. The R integral 
is then performed most efficiently using a optimized quadrature [136] so that the 
matrix elements become 


< vjttJMP\V int \v'j'Q JMP >= w vjv .j. ti v x (r, Ri)ff yx , (V.24) 

i,\ 

where w v j v >j^i is a vibrational quadrature weight, v\ is a potential expansion co- 
efficient, and fjtji is an angular integral (which is independent of M, J, and P). 

In general, it will also be necessary to generate the potential expansion coeffi- 
cients, and this is most straightforwardly done by projection: 

v\ = 1 J dcosx P\(™sxW int {r,R,x)- (V.25) 

The quadrature approximation to (V.25) is most efficiently evaluated as a matrix- 
vector product. Usually if the potential is given explicitly in terms of a Legendre 
expansion there are only a few terms in the A sum, however, if it is necessary to 
converge the angular expansion of a more general potential several tens of terms 
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are typically required. The difficulty of efficiently evaluating (V.24) arises for two 
reasons. First, the angular integrals are zero unless the triangle rule for j , j 1 , and 
A is satisfied, thus many integrals are zero and it is advantageous to exploit that. 
Second, the way in which two pieces of information depending only on a subset of 
the quantum numbers are combined together complicates efficient evaluation. 

We now consider three schemes to evaluate (V.23). In scheme A we eval- 
uate (V.25) by matrix-vector product and then form 

Vvjv'j', a(0 — ^ ^ w v j v < j > ,2 v\(r,Ri). (V.26) 

i 

This is performed as a matrix multiplication with vjv'j' labeling rows and A 
columns of the result. The inner index i is the number of points in the vibra- 
tional quadratures and usually is on the order of ten. Finally a sparse SAXPY 
is performed for each value of A to generate the matrix element. In scheme B the 
procedures of A are modified by forming the intermediate product 

A%,(r,Ri) = Y,vx(r,Ri)f?rx, (V.27) 

A 

which is a sparse SAXPY followed by 

< vjQ,JMP\V int \v'j'njMP Af r (r,Ri), (V.28) 


which is a vector plus vector times vector using scattered data. Finally in 
scheme C we dispense with the expansion of the potential and directly ^ form the 
Afj,( r ,Ri) via 


4 = 

n 

(V.29) 

where 


Cgj = c% V int (r, Ri, cos Xn), 

(V.30) 

and 


Cf n = V2™nYjSl(Xn,0), 

(V.31) 

w n a Gauss-Legendre quadrature weight. Equation (V.29) 

is evaluated as a ma- 


trix multiplication. The number of points in the angular quadrature is on the or- 
der of ten. 

t We are grateful to J. N. Murrell for suggesting this option. 
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The three schemes have various advantages and disadvantages. A draw- 
back of schemes A and B is that is it necessary to form the angular integrals 
fjjix-, while in scheme C it is necessary to form the Cp n , which is an easier task 
and can be done efficiently in vector mode. However, these steps are indepen- 
dent of r and R and so should not be of significant overall cost. In Scheme A 
the vibrational integral is computed very efficiently, but too much work is per- 
formed: many of the V v j v ij> f \ will not be required since many of the ff-, x are 
zero. Schemes B and C perform no more work than is required for the vibra- 
tional integrals, but (due to the scattering of data because a given vjv'j' pair 
contributes to more than one fl) this work will not be performed as efficiently in 
scheme A. 

Turning now to the angular part of the different methods, scheme A only 
performs the sparse SAXPYs once per A, while scheme B performs them once per 
A per vibrational quadrature point. Scheme C also needs to repeat the angular 
integrals for each vibrational quadrature point. However, if there were only one 
vibration quadrature point, scheme A would perform more work because the to- 
tal number of angular integrals (zero and nonzero) per A is proportional to the 
square of the number of vj states, whereas for schemes B and C the number is 
proportional to the square of the number of j states. Thus the relative efficiencies 
of schemes A and B will depend on particular circumstances, with a larger num- 
ber of v states coupled with a smaller number of vibrational quadrature points 
favoring scheme B and a small number of v states using many vibrational quadra- 
ture points favoring scheme A. In our experience, usually the efficiency perform- 
ing the vibrational integrals is the most important factor, thus scheme A is more 
efficient than scheme B. This is another example of how minimizing the operation 
count does not necessarily lead to the most efficient algorithm on vector comput- 
ers. 

In most cases, scheme B will be preferred over scheme C because the ma- 
trix multiplication to produce the angular integrals takes longer than the sparse 
SAXPYs. However, in special cases scheme C also has one potentially important 
advantage: it may be possible to greatly reduce the number of vj states used in 
the wave function expansion by using some sort of adiabatic rotational states 
rather than spherical harmonics [137]. That is, a few new functions Y n Sh which 
are linear combinations of many of the Y?q, could be defined and used instead. 
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This may allow otherwise intractable problems to be solved. A single left trans- 
formation of the Cf n is all that is required for scheme C, while the other schemes 
require similarity transformations for each value of A to produce new angular in- 
tegrals which now in general are not sparse, so the sparse SAXPYs are replaced 
with full SAXPYs. This will increase both the memory and the operation count 
for these schemes. The transformation times will be important because the trans- 
formations depend on r and hence must be carried out many times. 


Turning now to the case for AB + CD collisions, the matrix element to 
be found is [138] < v 1 v 2 jij 2 ji 2 ^JMP\V int \v , 1 v , 2 j , 1 j , 2 j[ 2 Q l JMP >, where Vi is the 
vibrational quantum number of molecule i, ji is the rotational quantum number 
of molecule «, j 12 is the quantum number of the vector sum of j\ and J 2 , J is the 
total angular momentum quantum number, M is its projection on the laboratory 
frame z axis, D is the projection of j 12 , J 5 an d J on b°dy fi xe d z axis, and P 
is the parity. The procedure here is very similar to the A + BC case, except that 
the manipulations are more complicated. Expanding the angular dependence of 
the potential in terms of combinations of spherical harmonics now involves three 
angles and three indices, thus the number of terms required is approximately the 
cube of that required for the atom-diatom case [139]. For anisotropic potentials, 
close to a thousand terms can be required to converge the angular representa- 
tion of the potential [140]. Thus although one may be tempted to ignore the pres- 
ence of zero angular integrals in the A + BC case to simplify vectorization, for 
the AB + CD case the number of integrals is so large that it is imperative that 
the zero integrals be identified and not stored. The aggregate number of angular 
quadrature points can also be in the thousands. The situation for the vibrational 
integrals is not quite as bad, because there are only two vibrational coordinates 
and so the number of quadrature points will be only the square of the A + BC 
case, say, 50 to 100 if optimized quadratures [136] are used. The three different 
schemes described above for performing the integrals can be applied here with 
little modification with the exception of scheme C which for Q greater than zero 
must be modified to include contributions from two numerical integrals, that is, 
two matrix multiplications are required, because of the more complicated coupling 
of the angular momentum vectors. Thus C^ n of (V.31) is replaced with the quan- 
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t; *y c h s lven b y 

t m i +l m i I 

C?*j,h,n =2[w«5 ni w„ 2 ui„3]5 2J (jim 1 j 2 m 2 \jij 2 ji 2 ^W jl \ mi \(-'i-) 2 

mi m2 

^ 2 | m 2 |(-l)'^^r i kcosXnJPjr 2 l (cosXn 2 >[^(m 2 - mi)] 

(V.32) 

where x is either sin or cos, n is a composite index denoting the three quadra- 
ture indices n\, ri 2 and ns, w n and w n the quadrature weights for the different 
angular integrations, (,..|...) is a Clebsch-Gordan coefficient, Nj m the (positive) 
normalization factor for the spherical harmonic Yj m , Pj n is an associated Leg- 
endre function, \i the center of mass BC to center of mass AB to B angle, \2 
the inclination angle for the second molecule, and (f> is the dihedral angle. If 
is zero, then only the functions with x — cos are required. For Q greater than 
zero, the two matrix multiplications are for cosine terms times cosine terms and 
sine terms times sine terms. The weighting of the various schemes changes some- 
what for this case because of the relative complexity of the analytic calculation 
of the angular integrals required for schemes A and B. Although these integrals 
are still independent of r, their calculation can be time-consuming enough that 
they can consume a significant fraction of the computational resources required 
for a scattering calculation. The analytic formulas involve 3 -j , 6-j and 9 -j sym- 
bols which are generated mostly in scalar mode. In contrast, scheme C, which 
only requires the functions of (V.32), involves only a Clebsch-Gordan coefficient 
(which is a rephased and normalized 3 -j symbol), thus the time to calculate the 
functions for scheme C will be much less than time to form the angular integrals 
required for the other two schemes. In addition, the efficient implementation of 
adiabatic rotational states possible with scheme C makes it rather attractive. 

An alternative scheme to reduce the operations required to evaluate the 
D matrix is to employ some form of discrete variable representation [141]. Here 
a transformation is made so that the potential coupling is diagonal with elements 
equal to the potential at grid points. The kinetic energy is not diagonal in this 
basis. So far this method has been applied only to rotational-like degrees of free- 
dom for three atom systems. Another way to reduce the effort to evaluate the po- 
tential matrix is to write the interaction potential in a special form whereby much 
of the angular and vibrational integrations need be done only once [122]. 

We also reiterate that the potential can be reused for calculations at 
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more than one energy, and if the body-frame matrix elements are used as dis- 
cussed above, they can be used for more than one value of the total angular mo- 
mentum J . Finally it may be possible to achieve speedups by evaluating the po- 
tential matrix using a coarser grid then is used to integrate (V .3) and then inter- 
polating the result. Thus if one is willing to perform extensive enough calcula- 
tions (many energies, many J), the evaluation of the potential matrix will be an 
insignificant part of the overall calculation, and the overall execution rate is deter- 
mined by the rate of the matrix manipulations in the algorithm used for solving 

(V.3). 

We now turn to some new algorithms and methods which have been 
made viable by the availability of the large CRAY-2 memory. In contrast to the 
quantum mechanical methods discussed so far, which propagate a solution along 
r, and hence need only store the relevant data for one value of r, the methods 
we mention now are more global in nature and require data for all values of r. 

The first method we discuss is the finite difference boundary value method (FD- 
BVM) [142,143]. Here the close-coupling equations are solved not by specifying 
two sets of boundary conditions at small r, but rather by specifying a boundary 
condition at both small and large r. This can be done in many ways, and the 
easiest is to set up a grid of r consisting of the points r i, r 2 , ... an d then 

approximate the derivative operator in (V.3) by finite differences, for example 

0-0 = £ c « f ( r >)- ( v - 33 ) 

j 

This converts the differential equation into a set of linear equations of the form 

Af = /?, (V.34) 

with A a band matrix made up of the D(r*) and Cq, f the vector made up of a 
particular column of f, say column n Q , at each of the N gr id distances, and (3 con- 
sists of values of column n 0 of f for distances less than rq and greater than r^ grid . 
For r less than rq we set f to zero. For r greater than rN grid , the boundary condi- 
tions are more complex, because they require knowledge of the scattering matrix, 
which is not available at this step of the calculation. Thus we are restricted to 
finite difference approximations which require knowledge of the solution only at 
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one distance beyond r^ grid . Then an arbitrary normalization can be introduced 
into (V.7) and we can take 

fnn 0 ( r N gr id + 1) = ^nn 0 - (V.35) 

The bandwidth of A depends on the number of points in the finite difference ap- 
proximation, and in our calculations we have found it efficient to use a 9-point ap- 
proximation everywhere except for the large r end of the grid where the number 
of points is reduced to seven and then eventually down to three in order that only 
one point beyond the end of the grid is required for the boundary conditions. Of 
course this necessitates using a smaller step size for the last grid points in order 
that the error of the solution not be dominated by the error in the final 3-point 
approximation [144], With this scheme the half-bandwidth of A is equal to 4 N , 
where N is the number of channels included in the close-coupling equations. 

There are several advantages of this method which offset to a certain de- 
gree the extra storage resources required by it. Perhaps the most important fea- 
ture is the high quality of radial functions produced. In contrast to initial value 
methods which directly determine the radial functions, the FDBVM has no dif- 
ficulties with instability due to classically forbidden regions. Thus if rather than 
just requiring the scattering matrix, one is interested in using the radial functions 
for other purposes, such as in integrals, then the FDBVM may be the method of 
choice. An important factor in making this worthwhile is the fact that there is no 
restriction on the grid points, that is, they can be unevenly spaced. We have used 
this feature to advantage by including Gaussian quadrature points in the finite 
difference grid and then used the resulting radial functions in numerical integrals 
using efficient Gaussian quadrature rules [144]. Another feature of the method is 
that it is very easy to solve inhomogeneous versions of (V.3), because the inho- 
mogeneity will simply appear in (3. In addition, because the work to solve (V.34) 
is primarily made up of forming the LU decomposition of A, then it is possible 
to solve for several inhomogeneities for little extra cost once the homogeneous 
problem has been solved [145]. The final advantage is that (V.34) can be solved 
by reasonably efficient black box programs, so that good execution rates can be 
obtained even for relatively small TV, in contrast to the step by step methods dis- 
cussed above. Another aspect of this is that most of the time will be spent solv- 
ing one set of linear equations, so the user need not be concerned with optimizing 
large amounts of code, but simply calling the library routine. 
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We next turn to methods which are not based on (V.3) but rather on 
equivalent integral equations. The number of methods here is very large [146- 
148], but the basic ideas are similar. Rather than numerically obtaining the ra- 
dial functions by integrating a differential equation, the radial functions (or re- 
lated quantities) are expanded in terms of known basis functions, and then the 
unknown coefficients are determined from the solution of a single set of linear 
equations, or the scattering matrix and related quantities are determined directly 
from matrix elements over the basis functions using linear algebra. Thus the work 
in forming the scattering matrix can be broken down into two parts, the calcula- 
tion of the matrix elements and the solution of the linear equations. If there are 
an average of M basis functions per each of the N channels, then for large enough 
M • JV, the work for the matrix element calculation will scale as (M • N) 2 while 
the work for the linear equations step will scale as (M • N) 3 . Thus rather then 
performing many matrix operations which scale as iV 3 , there is one large opera- 
tion scaling as (M ■ N ) 3 , so the size of M 3 compared to the number of integration 
steps will be important in determining the relative efficiency of these methods 
compared to the methods based on differential equations. However, these basis 
function methods have several advantages which can counterbalance inefficiencies 
and make them the methods of choice. A particular example is in the area of re- 
arrangement collisions. The quantum mechanical equations of motion based on 
the differential equation approach lead in this case to integro-differential equa- 
tions [149], which are very difficult to solve. Although it is possible by a suitable 
choice of coordinates to turn the rearrangement equations of motion into differ- 
ential equations [150], new complexities arise [151,152]. In contrast, the appli- 
cation of basis function methods leads only to exchange integrals which involve 
all known functions, so the methodology is virtually unchanged [144]. Another 
potential advantage is that the linear equations can be solved iteratively, so the 
work per initial state will scale as m(M ■ N ) 2 , where m is the number of iterations 
required [153-155]. This is an advantage when one is interested in transitions out 
of only a few initial states, so only a few columns of the scattering matrix are 
needed; however, all of the methods which we have discussed so far use arbitrary 
boundary conditions and hence require the generation of N linearly independent 
solutions in order to determine any column of the scattering matrix. Another 
potential advantage is based on the way scattering calculations are usually per- 
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formed. In order to assess the convergence of the calculations, it is necessary to 
perform a sequence of calculations which differ in the number of basis functions 
used, and thus many problems which are very similar are solved. What we wish 
to do is use some of the information from smaller, preliminary calculations, to 
make the larger calculations less expensive. One way to do this is to form linear 
combinations of the M basis functions based upon smaller calculations and use 
a smaller number of these contracted functions in the linear equation step [156]. 
Even if the number of basis functions per channel can be reduced only slightly, 
the cubic operation count will magnify this reduction and make it a worthwhile 
step. The parallels here with improving the basis functions in electronic structure 
calculations (such as the use of atomic natural orbitals) are striking. 

These basis function methods are still quite new for applications to heavy 
particle collisions, and are undergoing rapid development. One important area 
which needs further work is the efficient evaluation of the matrix elements. While 
the scaling arguments presented above show that this will be a negligible step for 
large enough calculations, experience to date is that the coefficient multiplying 
the (M • N) 2 factor is large enough so that it can dominate the calculation. How- 
ever, the best implementation here is probably yet to appear. 

All of this discussion relies on the assumption that the resources for the 
calculations exist, and this means primarily memory. If a calculation based on the 
differential equation approach, which only requires the storage of matrices of or- 
der N 2 , taxes the memory of a machine, clearly the basis function approach is not 
practical. For this reason, almost all converged calculations of three-dimensional 
heavy particle collisions have been performed on the CRAY-2. 

We now turn to the final method to be considered. Here we move from 
the time-independent Schrodinger equation to the time-dependent form. This in- 
troduces many new complexities into the calculations, but at the same time sev- 
eral advantages appear. The basic idea is that the time-dependent Schrodinger 
equation is first order in time, so the initial conditions determine the solution for 
all other times [157]. This in turn requires that the calculations are for a partic- 
ular initial quantum state, in contrast to most of the methods discussed above. 
The initial condition consists of a mixed basis function/grid representation of the 
wave function, and the initial translational energy range is determined via the 
uncertainty principle from the spatial localization of the wave function. The solu- 
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tion is propagated forward in time until after the collision is over and the results 
are analyzed to determine the probabilities for the various final states. These cal- 
culations are both computationally intensive and consume considerable storage 
resources, so most large-scale calculations have been carried out on the CRAY-2. 
There are three main advantages of the method. First, as mentioned above, the 
method only considers a single initial state. Thus the scaling of the operations 
of the method will be less than the third power, unlike the previous methods dis- 
cussed, and so for large enough calculations this method will be more efficient. 
Second, since the initial conditions consist of a spread of initial energies, it is pos- 
sible to obtain results for many different energies from a single time propagation 
run. Finally, since the time-dependent equation is solved and one is explicitly 
dealing with wave packets, the physical interpretation of the results can bring 
more insight than for the time-independent methods. 

C. The influence of supercomputers on dynamics calculations. 

Relatively few classical dynamics studies utilizing vectorized codes run- 
ning on Cray computers have been reported in the literature, mainly because the 
calculations can be carried out on slower machines. Indeed, the time per trajec- 
tory is small enough that even personal computers have been used. What con- 
tributes to the overall time is the number of trajectories required, and the advan- 
tage of a vectorized code is that results can be obtained in much less real time 
(perhaps days rather than months). This offers the opportunity for much better 
feedback to others whose work depends on the outcome of the dynamics, just as 
discussed in section IV for the electronic structure case. 

In the area of quantum dynamics, the advent of supercomputers has 
opened entire new horizons. The available processing power is beginning to make 
the study of non-reactive collisions in systems of more than three atoms feasible; 
this will allow the investigation of phenomena like vibrational-to-vibrational en- 
ergy transfer, a process which cannot occur for less than four atoms. The large 
memory of the CRAY-2 has been the main impetus behind a renaissance in quan- 
tum reactive scattering, for the application of the basis function methods that 
have proved so fruitful would not be practical on smaller machines. 
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VI. Conclusions and future directions 


We have reviewed the performance of essentially all the currently avail- 
able Cray computers on typical tasks in quantum chemistry and dynamics. Our 
discussions show that it is possible to achieve a large fraction of the possible ma- 
chine performance in all phases of these calculations, and we have also consid- 
ered the impact these developments have had on research in quantum chemistry 
and dynamics. In view of the fact that one consequence of an increase in avail- 
able computing power is to whet researchers’ appetites for even more computing 
power, we shall discuss briefly here some future directions. 

The most important development in the next few years is likely to be a 
much greater use of multitasking on Cray computers. As we have discussed here 
and elsewhere [28], multitasking will probably be a necessity to ensure that all 
CPUs are kept busy in multiple CPU systems, and it should also be used when 
a single job needs access to a large fraction of system resources. The arrival of 
microtasked library subroutines should encourage more multitasking, as will the 
incorporation of automatic multitasked code generation into the CFT77 compiler, 
but some burden will fall on the programmer if coarse-grained parallelism is to be 
exploited. There has been little work reported investigating the use of multitask- 
ing on normal production machines: our own experiences suggest that some work 
remains to be done at the operating system level in implementing multitasking. 
Given that there is the potential on the Y-MP to achieve close to 2.5 GFLOPS 
using all eight CPUs, as discussed in section HID, the rewards from multitasking 
can be considerable, and we can expect to see much work in this area in the near 
future. 

Another important development will be the arrival of new supercomputer 
models. The CRAY-3, featuring 16 CPUs and a clock period of 2 ns and using 
gallium arsenide technology, has been described in some detail [158]. The theoret- 
ical maximum performance would be 1 GFLOPS per CPU, or 16 GFLOPS using 
all CPUs together. This machine is an obvious development from the CRAY-2, 
and much of the experience with the latter should carry over to the CRAY-3. 
There is already discussion of the CRAY-4 [158], which is planned to out-perform 
the original CRAY-1 by a factor of 1 000 — three orders of magnitude in some 
twenty years. These new machines will have memories of 1 GW and larger, which 
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will stimulate not only the wider use of some of the large-memory-based algo- 
rithms we have described in sections IV and V, but also the development of new 
schemes that can exploit even larger memories. This will undoubtedly lead to 
novel and exotic approaches to many problems, but the more commonplace meth- 
ods described earlier will also see substantial performance benefits from the newer 
machines. It should be noted, however, that the size of electronic structure prob- 
lems that can be tackled using conventional methods is commonly limited by ex- 
ternal storage capacity even on the current range of Cray computers, and this 
limitation will only become more acute as faster machines appear. While meth- 
ods designed to circumvent this limitation will undoubtedly be developed, it is 
very likely that providing adequate external storage will be a major problem for 
supercomputer vendors in the next decade. 

Having reviewed developments in hardware and software for computa- 
tional chemistry on Cray computers, a final conclusion can be drawn. The power 
of Cray supercomputers, with their particular suitability for computational chem- 
istry, has stimulated many important and fundamental developments in the field, 
and as the next generations of Cray computers appear we can be confident that 
this trend will continue. There is therefore every reason for optimism about what 
the rather natural pairing of computational chemistry and Cray supercomputers 
will bring forth. 
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Table 1. SAXPY performance (in MFLOPS) on CRAY X-MP/48. 




Table 2. SAXPY performance (in MFLOPS) on Cray computers . 


N a X-MP/14se 6 X-MP/48 6 Y-MP/832* CRAY-2* C CRAY-2 C 

4 

11.1 

11.4 

15.2 

3.7 

3.3 

8 

22.1 

22.7 

30.3 

7.0 

6.5 

12 

33.2 

34.1 

45.5 

10.1 

9.5 

16 

44.2 

44.7 

60.6 

13.0 

12.1 

25 

69.0 

68.0 

94.5 

19.1 

18.8 

32 

84.8 

86.3 

120.9 

23.2 

22.7 

63 

112.9 

132.8 

211.5 

40.3 

38.1 

64 

103.3 

128.6 

213.5 

42.6 

37.6 

127 

104.4 

119.3 

213.1 

41.1 

53.4 

128 

103.9 

135.5 

214.1 

52.2 

50.8 

255 

119.6 

131.0 

242.1 

58.0 

67.8 

256 

104.0 

121.2 

242.5 

80.5 

62.2 

300 

120.5 

158.6 

178.7 

70.8 

66.2 

511 

121.7 

143.3 

206.8 

63.0 

45.1 

512 

120.3 

140.6 

206.5 

67.2 

63.1 

roo (ni/ 2 ) 

122 (22) 

159 (30) 

243 (32) 

81 (63) 

68 (55) 


° Vector length. 

6 CFT77 gives best performance. 
c SCILIB routine gives best performance. 


Table 3. Matrix multiplication performance (in MFLOPS) on CRAY X-MP/48. 


N a 

DOT 6 

SAXPY C 

4*unrolled d 

MXM e 

4 

2.2 

12.1 

9.6 

22.8 

8 

4.6 

24.3 

22.2 

72.4 

10 

5.7 

29.4 

27.0 

92.8 

12 

6.8 

34.1 

33.6 

111.4 

16 

9.0 

40.5 

42.9 

134.5 

25 

14.3 

56.8 

63.1 

159.9 

32 

17.2 

67.0 

75.0 

172.3 

50 

25.8 

80.4 

98.0 

186.2 

63 

33.3 

91.5 

103.8 

190.8 

64 

22.5 

91.7 

107.0 

191.3 

75 

31.7 

87.1 

96.5 

180.1 

100 

38.6 

98.5 

118.2 

189.9 

127 

53.9 

113.3 

127.5 

194.2 

128 

30.2 

110.8 

125.0 

194.3 

175 

70.5 

120.1 

131.0 

194.0 

255 

82.5 

116.2 

131.9 

195.8 

256 

39.8 

117.9 

135.3 

195.8 

400 

83.0 

122.2 

137.8 

194.8 

511 

103.5 

124.3 

139.5 

196.5 

512 

40.8 

126.3 

143.7 

196.6 

roo (rci/ 2 ) 

104 (127) 

126 (127) 

144 (31) 

197 (11) 


a Vector length. 

b Dot product inner loop; CFT77. 
c SAXPY inner loop; CFT77. 
d Unrolled to a depth of four (see text); CFT77. 
e SCILIB routine. 


Table 4. Matrix multiplication performance (in MFLOPS) on Cray computers. 


N a 

X-MP/14se 

X-MP/48 

Y-MP/832 

CRAY-2* 

CRAY-2 

4 

21.8 

22.8 

32.1 

19.7 

18.5 

8 

68.0 

72.4 

101.8 

63.1 

58.5 

10 

88.1 

92.8 

133.2 

86.4 

82.2 

12 

105.0 

111.4 

159.7 

104.8 

98.2 

16 

130.8 

134.5 

201.7 

146.5 

132.1 

25 

153.9 

159.9 

239.3 

203.6 

178.1 

32 

164.3 

172.3 

256.8 

249.4 

223.4 

50 

177.1 

186.2 

277.6 

312.8 

301.3 

63 

181.8 

190.8 

285.3 

339.6 

330.6 

64 

181.7 

191.3 

285.8 

361.7 

329.6 

75 

171.9 

180.1 

268.4 

288.2 

255.5 

100 

180.6 

189.9 

283.1 

334.6 

295.5 

127 

184.6 

194.2 

289.8 

380.3 

337.9 

128 

184.7 

194.3 

290.0 

361.8 

330.9 

175 

184.3 

194.0 

289.1 

337.6 

321.5 

255 

186.1 

195.8 

292.0 

382.9 

235.3 

256 

186.1 

195.8 

292.1 

337.9 

256.7 

400 

185.1 

194.8 

290.4 

340.3 

264.5 

511 

186.8 

196.5 

293.1 

379.0 

238.6 

512 

186.1 

196.6 

293.1 

342.9 

241.2 

foo (n 1/2 ) 

187 (10) 

197 (11) 

293 (11) 

383 (33) 

338 (22) 


a Vector length. 



Table 5. Performance of specialized matrix multiplication techniques (in 
MFLOPS). 


N a 

CRAY X-MP/48 


CRAY-2* 


MXM 6 

8*unrolled c 

MXM* 

8*unrolled c 

Strassen d 

MXMPMA £ 

16 

136 

52 

111 

56 

93 

110 

63 

191 

95 

319 

110 

364 

370 

64 

191 

103 

285 

168 

367 

373 

127 

194 

111 

289 

187 

385 

386 

128 

194 

112 

381 

225 

383 

378 

255 

196 

112 

259 

188 

386 

387 

256 

196 

113 

304 

233 

385 

383 

511 

197 

117 

387 

242 

407 

387 

512 

197 

117 

328 

259 

423 

387 


a Vector length. 
b SCILIB routine. 

c Unrolled to a depth of eight (see text); CFT2. 
d Strassen algorithm, (Bailey, Ref. [27]). 
e Calahan et al ., Ref. [23]. 



Table 6. Performance (in MFLOPs) for different operations®. 


Operation 

X-MP/14se 

X-MP/48 

Y-MP/832 

CRAY-2 

CRAY-2* 

Vector add 

60 (24) 

70 (24) 

119 (28) 

39 (63) L 

48 (72) L 

Dot product 

90 (96) 

95 (91) 

148 (102) 

80 (132) L 

103 (96) L 

SAXPY 

122 (22) 

159 (30) 

254 (31) 

68 (55) L 

101 (128) L 

Vector divide 6 

25 (16) 

26 (15) 

41 (16) 

24 (23) 

26 (26) 

MXV C 

187 (17) L 

195 (18) L 

312 (19) L 

242 (27) L 

294 (27) L 

MXM 1 * 

187 (10) L 

197 (11) L 

308 (12) L 

338 (22) L 

383 (33) L 

Cubic 6 

168 (17) 

174 (15) 

288 (19) 

113 (15) 

121 (17) 

Sextic 6 

178 (14) 

187 (13) 

297 (16) 

142 (17) 

148 (17) 

[2/1]^ 

104 (13) 

109 (12) 

170 (13) 

116 (16) 

118 (17) 

[3/2]/ 

125 (11) 

132 (11) 

207 (12) 

154 (16) 

158 (17) 

square root 6 

11 (15) 

12 (16) 

14 (15) 

23 (21) 

24 (19) 

sine 6 

3(9) 

4(10) 

5(11) 

6(13) 

6(16) 

arctangent 6 

5(11) 

5(11) 

8(13) 

6(16) 

6(15) 

exponential 6 

6(12) 

6(12) 

9(13) 

9 (25) 

9 (25) 

He 6 

4(11) 

4(11) 

6 (11) 

6(16) 

6(15) 

MOVE 9 

76 (23) 

83 (25) 

143 (30) 

51 (37) 

72 (47) 

GATHER' 1 

26 (10) 

46 (14) 

80 (19) 

17 (22) 

20 (25) 

SCATTER' 1 

32 (11) 

45 (13) 

88 (20) 

20 (16) 

24 (16) 

SPDOT* 

32 (67) L 

34 (64) L 

52 (92) L 

40 (230) L 

48 (192) L 

SPAXPY* 

44 (58) L 

47 (55) L 

77 (66) L 

31 (63) L 

42 (96) L 


a L denotes SCILIB routine, otherwise CFT77. n 1 / 2 values in parentheses. 
6 Performance in megaresults per second. 
c Matrix-vector product. 
d Matrix multiplication. 

e Polynomial of given order with vector of arguments. 
f Rational fraction of given order with vector of arguments. 

9 Vector move, performance in MW/s. 
h Performance in MW /s. 

* Sparse vector operations (see text). 


Table 7. Macrotasked matrix multiplication performance (in MFLOPS) a . 



1 CPU 

2 CPUs 

4 CPUs 

8 CPUs 

X-MP/48 

200 

399 

796 


CRAY-2 

420 

756 

1216 


Y-MP/832 

293 

585 

1166 

2320 


a Obtained in stand-alone mode on X-MP/48 and CRAY-2. In stand-alone mode 
the Y-MP rates would be some 3-4% higher. 


Table 8. Gaussian integrals and SCF timings for N 2 on CRAY X-MP/48 a (in seconds). 


D b 
U 2 h 

D2h 


c 5 

Ci 

Njnt Time 

Njnt Time 

Njnt Time 

Njnt Time 

Njnt Time 


Integrals 

6230657 

407.8 

1837575 

472.4 

3426083 

477.3 

8416943 

955.6 

15741082 

1676.3 

Symmetry 

aaaa 

a(3a(3 

aa(3fi 

a(3j6 

Nord 

509026 

3411984 

940545 

3324672 

Time 

0.5 

7.9 / 

1.7 

10.2^ 

Nord 

162710 

1223503 

347383 

1321917 

Ordering of symmetry integrals d 
Time Nord Time Nord 

0.2 1237527 0.6 6847625 

1.2 3718896 6.1-^ 7840000 

0.7 995841 1.2 2037700 

2.1 1249248 1.3 

Time 

12.6-f 

14.9-f 

5.0' 

Nord 

37271025 

Time 
62. 9-^ 

Total 

8186227 

20.4 

3055513 

4.3 

7201512 

9.3 

16725325 

32.5 

37271025 

62.9 






SCF calculation 







Iter ft 

F 

Iter 

F 

Iter 

F 

Iter 

F 

Iter 


0.4 

0.6 

0.2 

0.4 

0.4 

0.7 

0.6 

1.2 

1.0 

2.2 


a [53 4 p 3 d 2/ 1 g] basis; spherical harmonic functions used except where indicated. 

b Cartesian basis functions. 

c Number of non- zero integrals computed. 

d Ordering performed in memory unless otherwise specified. 

e Number of ordered integrals generated. 

f Ordering uses direct access storage (SSD). 

9 Closed-shell Fock matrix construction. 
h Closed-shell SCF iteration time. 


Table 9. Integral, SCF and transformation timings for N 2 (in seconds)". 




X-MP/48 CRAY-2* 

Y-MP/832 

Integrals 

Njnt 

1837575 

Integral calculation 
472.4 

435.8 

210.8 

Symmetry 

aa.cx.ot. 

Nord 

162710 

Integral ordering 
0.2 

0.4 

0.1 


1223503 

1.2 

2.7 

0.6 

aoi/3j3 

347383 

0.7 

1.5 

0.4 

aj3j 6 

1321917 

2.1 

5.2 

1.2 

Total 

3055513 

4.3 

9.9 

2.3 

Fock matrix d 


SCF calculation 
0.2 

1.3 

0.2 

SCF iter 6 


0.4 

1.4 

0.3 

Transformation 

Integral transformation-^ 
8.6 

8.3 

6.6 


“ [5s 4 p 3 d 2/ 1 g] spherical harmonic basis. 

b Number of non-zero integrals computed in D 2/1 symmetry. 

c Number of ordered integrals of each symmetry type. All ordering is done in 

memory. 

d Closed-shell Fock matrix construction. 
e Closed-shell SCF iteration. 
f 1 a g and lcr u MOs frozen in transformation. 


Table 10. Cl timings for N 2 on CRAY X-MP/48(in seconds)®. 


Calculation 

Time 

Formula tape for 16 internals 6 

<0.1 

Direct Cl iteration (17 626 CSFs) 

0.9 

Formula tape for 2804 internals 0 

12.5 

Direct Cl iteration (729 950 CSFs) 

79.7 

Processing ( ai\bj ) integrals 

38.3 

Processing (ai\jk) integrals 

18.1 

Processing (ab\cd) integrals 

10.2 

a [5s 4 p 3 d 2/ 1 g] basis: l<y g and 1 a u MOs frozen. 
b Single reference CSF, 10 electrons correlated. 
c CAS reference space (6 active electrons in 6 orbitals 

— 32 CSFs), 10 electrons 

correlated. 



Table 11. Execution times and rates for classical trajectories. 



Fraction of total time 



MFLOPS 




X-MP/48 0 

Y-MP/832 i 

CRAY-2 C 

CRAY-2* C 

X-MP/48 

Y-MP/832 

CRAY-2 

CRAY-2* 




f+h 2 






Gradients of Potential 

0.36 

0.39 

0.29 

0.31 

130 

190 

160 

180 

Build Q n and Chain rule 

0.27 

0.24 

0.20 

0.20 

120 

210 

160 

190 

sub total 

0.63 

0.64 

0.49 

0.51 

126 

200 

160 

184 

Integrator 

0.37 

0.36 

0.51 

0.49 

55 

93 

40 

50 

overall 





100 

160 

99 

118 




h 2 +h 2 






Gradients of Potential 

0.94 

0.94 

0.94 

0.94 

110 

190 

100 

120 

Build Q n and Chain rule 

0.03 

0.03 

0.02 

0.03 

140 

210 

140 

160 

sub total 

0.97 

0.97 

0.96 

0.97 

111 

190 

101 

121 

Integrator 

0.03 

0.03 

0.04 

0.03 

60 

110 

50 

50 

overall 





110 

190 

98 

119 


a Using CFT compiler. The CFTT7 compiler produces bad code for this problem. 
b Using CFT 77 compiler. 

c Using CFT77 compiler. Due to the presence of some features of FORTRAN 77 which are only available under CFT77, 
we do not use CFT2. 



Table 12. Characterization of the execution rates for matrix operations. 






Routine 



Machine 

MXM a 

RS 6 LUSOLV c 

LUSOLV^ 

MINV SGEF A/SL e 

SSPFA/SL' 

X-MP/48 

200 5 

170 

190 

210 

73 

130 

73 

2 

150 

110 

120 

0 

200 

470 

Y-MP/832 

290 

240 

. . . 

300 

120 

250 

120 


1 

100 

• • ♦ 

no 

7 

150 

470 

CRAY-2 

310 

120 

170 

240 

340 

120 

43 


4 

25 

160 

290 

80 

140 

330 

CRAY-2* 

370 

140 

210 

200 

360 

140 

50 


1 

30 

160 

150 

60 

140 

280 


a matrix multiply routine from SCILIB. 

b EISPACK [57] routine for real symmetric eigenvalues and eigenvectors from 


SCILIB. 

c FORTRAN program for general linear equation solution [129,130] compiled us- 
ing CFT2. 

d FORTRAN program for general linear equation solution compiled using CFT77 . 
e LINPACK [131] program for general linear equation solution from SCILIB. 
f LINPACK program for symmetric linear equation solution from SCILIB. 

9 Upper entry: estimate of asymptotic execution rate in MFLOPS, lower entry: 
fitted estimate (see text) of matrix order required to achieve half of the asymp- 
totic rate. 


Abstract 


The influence of recent developments in supercomputing on computa- 
tional chemistry is discussed with particular reference to Cray computers and 
their pipelined vector /limited parallel architectures. After reviewing Cray hard- 
ware and software we. examine the performance of different elementary program,^ u+ (_ ; 
structures iScwiiS&'effective methods for improving program performanceJ-We- 


then-diseuss' the computational strategies appropriate for obtaining optimum per 
formance in applications to quantum chemistry and dynamics^Finally, some dis- 
cussion is given of new developments and future hardware and software improve 
ments. 
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